rethinkdb / rethinkdb

The open-source database for the realtime web.
https://rethinkdb.com
Other
26.75k stars 1.86k forks source link

Crash during `delete()`: Guarantee failed: [res == archive_result_t::SUCCESS] #4769

Closed soplakanets closed 9 years ago

soplakanets commented 9 years ago

After issuing following query `r.db('db').table('table').delete() RethinkDB server crashed almost instantly.

Log file records:

2015-08-26T07:41:36.204805696 11.762660s error: Error in src/rdb_protocol/lazy_json.cc at line 22:
2015-08-26T07:41:36.205065318 11.762920s error: Guarantee failed: [res == archive_result_t::SUCCESS] Deserialization of rdb value failed with error archive_result_t::RANGE_ERROR.
2015-08-26T07:41:36.205159079 11.763014s error: Backtrace:
2015-08-26T07:41:36.332349898 11.890208s error: Wed Aug 26 07:41:36 2015\n\n1: backtrace_t::backtrace_t() at ??:?\n2: format_backtrace(bool) at ??:?\n3: report_fatal_error(char const*, int, char const*, ...) at ??:?\n4: get_data(rdb_value_t const*, buf_parent_t) at ??:?\n5: lazy_json_t::get() const at ??:?\n6: rget_cb_t::handle_pair(scoped_key_value_t&&, concurrent_traversal_fifo_enforcer_signal_t) at ??:?\n7: concurrent_traversal_adapter_t::handle_pair_coro(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t) at ??:?\n8: callable_action_instance_t<std::_Bind<std::_Mem_fn<void (concurrent_traversal_adapter_t::*)(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> (concurrent_traversal_adapter_t*, scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> >::run_action() at ??:?\n9: coro_t::run() at ??:?
2015-08-26T07:41:36.332649646 11.890504s error: Exiting.

Single server setup. Server info:

$ rethinkdb --version
rethinkdb 2.1.2~0trusty (GCC 4.8.2)
uname -a
Linux rdb.local 3.13.0-52-generic #85-Ubuntu SMP Wed Apr 29 16:44:17 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
soplakanets commented 9 years ago

It's worth mentioning that same issue happened on previous 2.1.1 release, I upgraded in hope of this being fixed in 2.1.2.

soplakanets commented 9 years ago

Well, it looks like this screwed up the table. Now any query on the table crashes server with same error.

1lann commented 9 years ago

I would like to mention that this same error happened to me a few days ago, I was running rethinkdb 2.1.1 (CLANG 4.2 (clang-425.0.28)) on OS X. However, this only happened when I ran rethinkdb dump and the server would crash with this error:

error: Error in src/rdb_protocol/lazy_json.cc at line 22:
error: Guarantee failed: [res == archive_result_t::SUCCESS] Deserialization of rdb value failed with error archive_result_t::RANGE_ERROR.
error: Backtrace:
error: Sun Aug 23 21:36:35 2015
       1: 0   rethinkdb                           0x0000000105d1dd2b _Z11deserializeIL17cluster_version_t6EE16archive_result_tP13read_stream_tP17repli_timestamp_t + 539 at 0x105d1dd2b ()
       2: 0   rethinkdb                           0x000000010588192c _ZN26callable_action_instance_tINSt3__16__bindIRFvNS0_8functionIFvvEEEEJRS4_EEEED0Ev + 3724 at 0x10588192c ()
       3: 0   rethinkdb                           0x0000000105882167 _ZN26callable_action_instance_tINSt3__16__bindIRFvNS0_8functionIFvvEEEEJRS4_EEEED0Ev + 5831 at 0x105882167 ()
       4: 0   rethinkdb                           0x000000010588154a _ZN26callable_action_instance_tINSt3__16__bindIRFvNS0_8functionIFvvEEEEJRS4_EEEED0Ev + 2730 at 0x10588154a ()
       5: 0   rethinkdb                           0x0000000105ae67ec _Z11deserializeIL17cluster_version_t6EE16archive_result_tP13read_stream_tP13name_string_t + 8204 at 0x105ae67ec ()
       6: 0   rethinkdb                           0x0000000105c17107 _ZNSt3__16vectorINS0_I15lon_lat_point_tNS_9allocatorIS1_EEEENS2_IS4_EEE21__push_back_slow_pathIS4_EEvOT_ + 1047 at 0x105c17107 ()
       7: 0   rethinkdb                           0x0000000105c17267 _ZNSt3__16vectorINS0_I15lon_lat_point_tNS_9allocatorIS1_EEEENS2_IS4_EEE21__push_back_slow_pathIS4_EEvOT_ + 1399 at 0x105c17267 ()
       8: 0   rethinkdb                           0x0000000105b4325b _ZN2ql11deserializeIL17cluster_version_t6EEE16archive_result_tP13read_stream_tPNS_11batchspec_tE + 16523 at 0x105b4325b ()
       9: 0   rethinkdb                           0x0000000105891e84 _ZN30concurrent_traversal_adapter_t16handle_pair_coroEP18scoped_key_value_tP15semaphore_acq_t27fifo_enforcer_write_token_tN14auto_drainer_t6lock_tE + 116 at 0x105891e84 ()
       10: 0   rethinkdb                           0x0000000105891faa _ZN26callable_action_instance_tINSt3__16__bindIM30concurrent_traversal_adapter_tFvP18scoped_key_value_tP15semaphore_acq_t27fifo_enforcer_write_token_tN14auto_drainer_t6lock_tEEJPS2_S4_S6_RS7_S9_EEEE10run_actionEv + 106 at 0x105891faa ()
       11: 0   rethinkdb                           0x0000000105877c68 _ZN22threaded_context_ref_tC2Ev + 1304 at 0x105877c68 ()
error: Exiting.
Trace/BPT trap: 5

when the dumping reached 14%, and the same result occurred as I tried again 3 times.

I decided to try to increase the --cache-size manually to 4 GB, and close any memory intensive applications, which seemed to have fixed the problem as rethinkdb dump executed successfully. Interestingly, when I removed the --cache-size flag and tried dumping again, the error didn't occur, and hasn't occurred since.

danielmewes commented 9 years ago

This is very concerning for obvious reasons, and we would like to find out how this happened as quickly as possible.

@soplakanets It would be extremely helpful if we could get a copy of your data files. We are happy to sign an NDA if there's any sensitive data in there.

I recommend taking the following steps:

  1. Find out the id of the affected table: r.db(...).table(...).config() shows you the table id in the id field of the result
  2. Shut down the rethinkdb server
  3. Copy the following two files from the RethinkDB data directory to a different location: The file metadata and the file that has the id of the affected table as its filename.
  4. Delete the file with the table id. This is so you can avoid further crashes when accessing the table. After restarting, the table file will be recreated as an empty table (since you ran a table.delete() when this happened, I assume this is fine?)
  5. Restart the server

I'll send you a separate email with information on how to best upload the data files to us.

danielmewes commented 9 years ago

@1lann Thanks for adding your report. Were you storing geospatial data in one of your tables?

soplakanets commented 9 years ago

It would be extremely helpful if we could get a copy of your data files. We are happy to sign an NDA if there's any sensitive data in there.

No need for NDA, this is an auto-generated data which doesn't make sense without context. The problem is that I had to drop the table in order to have it working again. I will try to run the data generation script in same order to reproduce it again.

I recommend taking the following steps

As I said above, doing dropTable() helped recover from the issue. Thanks for your help.

1lann commented 9 years ago

@danielmewes No I did not store any geospatial data whatsoever. I had a total of 3 tables, they contained ~5k, ~100k, ~2M documents each. They were very basic fields with numbers and strings. In the table with 5k documents, one of the fields contained an array with 10 elements. In the same 5k table, I used a compound secondary key consisting of 2 fields, and a few other basic secondary keys.

soplakanets commented 9 years ago

@danielmewes Unfortunately, I have same issue again. I will send you the data you requested via email.

Update: sent the data via hightail.

danielmewes commented 9 years ago

Thank you for the data @soplakanets! We'll investigate this asap.

danielmewes commented 9 years ago

Quick update: We've analyzed the data file and have derived a unit test that can reproduce a leaf node corruption. We're still investigating the details of the problem.

danielmewes commented 9 years ago

This turned out to be a really old bug in the btree code, that could cause memory and/or data corruption in certain (usually rare) corner cases. It's quite embarrassing that we've had this serious bug around for such a long time. At the same time I'm glad that we found it, thanks to your help @soplakanets .

It's possible that some fixes to our leaf node timestamp handling in RethinkDB 2.1 have made the problem surface more easily.

We will push out a bug fix release as quickly as possible.

danielmewes commented 9 years ago

A fix is implemented in branch daniel_4769, and is now in code review 3210 by @Tryneus .

danielmewes commented 9 years ago

Fixed in v2.1.x by commits 65572886e307231466813d80572872237b09a2db and 2f8aef3b0f7213364e59a5042b7761060d604548 . Fixed in next by 5596db1 and c968442.

The fix will ship with RethinkDB 2.1.3.

danielmewes commented 9 years ago

Also cherry-picked the first fix into v2.0.x with 7bd1db8f6cb9e6b60d357654d17ba506ba57dc80 , in case we release another 2.0.x point release at some point.

danielmewes commented 9 years ago

The likely fix for this shipped with RethinkDB 2.1.3. We strongly recommend everyone to upgrade as quickly as possible, to avoid potential data corruption problems.

@soplakanets It's always a bit difficult with these hard-to-reproduce bugs to know for sure whether a given patch actually fixed the issue. I recommend upgrading to 2.1.3, but to still keep an eye on the issue to make sure it doesn't appear again (your workload seems to be particularly prone to exposing this bug).

soplakanets commented 9 years ago

@dalanmiller Thanks. Will upgrade asap.

Bacto commented 9 years ago

Hi guys,

First I would like to thank you for the amazing work you do on RethinkDB! I love it!

Unfortunately I have RethinkDB 2.1.3 and I have a similar bug.

I have a cluster of 2 servers but the table I have a problem with is only on one server. When I try to delete all the datas (~2M documents) of this table, the server crash. There is no other access on this table during the crash.

r.db('test').table('maxMindBlocksIPv4').delete()
2015-09-13T15:28:19.198159460 197.522587s error: Error in src/arch/runtime/thread_pool.cc at line 359:
2015-09-13T15:28:19.264949298 197.589377s error: Segmentation fault from reading the address 0x7f018563ff76.
2015-09-13T15:28:19.302892953 197.627320s error: Backtrace:
2015-09-13T15:28:20.187061917 198.511491s error: Sun Sep 13 15:28:19 2015\n\n1: backtrace_t::backtrace_t() at ??:?\n2: format_backtrace(bool) at ??:?\n3: report_fatal_error(char const*, int, char const*, ...) at ??:?\n4: linux_thread_pool_t::fatal_signal_handler(int, siginfo_t*, void*) at ??:?\n5: /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340) [0x7f02eea29340] at 0x7f02eea29340 (/lib/x86_64-linux-gnu/libpthread.so.0)\n6: /lib/x86_64-linux-gnu/libc.so.6(+0x1524d4) [0x7f02ee7a54d4] at 0x7f02ee7a54d4 (/lib/x86_64-linux-gnu/libc.so.6)\n7: leaf::garbage_collect(value_sizer_t*, leaf_node_t*, int, int*, boost::optional<int>) at ??:?\n8: leaf::move_elements(value_sizer_t*, leaf_node_t*, int, int, int, leaf_node_t*, int, int, std::vector<void const*, std::allocator<void const*> >*) at ??:?\n9: leaf::level(value_sizer_t*, int, leaf_node_t*, leaf_node_t*, btree_key_t*, std::vector<void const*, std::allocator<void const*> >*) at ??:?\n10: check_and_handle_underfull(value_sizer_t*, buf_lock_t*, buf_lock_t*, superblock_t*, btree_key_t const*, value_deleter_t const*) at ??:?\n11: apply_keyvalue_change(value_sizer_t*, keyvalue_location_t*, btree_key_t const*, repli_timestamp_t, value_deleter_t const*, key_modification_callback_t*, delete_mode_t) at ??:?\n12: kv_location_delete(keyvalue_location_t*, store_key_t const&, repli_timestamp_t, deletion_context_t const*, delete_mode_t, rdb_modification_info_t*) at ??:?\n13: rdb_replace_and_return_superblock(btree_loc_info_t const&, btree_point_replacer_t const*, deletion_context_t const*, promise_t<superblock_t*>*, rdb_modification_info_t*, profile::trace_t*) at ??:?\n14: do_a_replace_from_batched_replace(auto_drainer_t::lock_t, fifo_enforcer_sink_t*, fifo_enforcer_write_token_t const&, btree_loc_info_t const&, one_replace_t, ql::configured_limits_t const&, promise_t<superblock_t*>*, rdb_modification_report_cb_t*, bool, ql::datum_t*, profile::trace_t*, std::set<std::string, std::less<std::string>, std::allocator<std::string> >*) at ??:?\n15: std::_Function_handler<void (), std::_Bind<void (*(auto_drainer_t::lock_t, fifo_enforcer_sink_t*, fifo_enforcer_write_token_t, btree_loc_info_t, one_replace_t, ql::configured_limits_t, promise_t<superblock_t*>*, rdb_modification_report_cb_t*, bool, ql::datum_t*, profile::trace_t*, std::set<std::string, std::less<std::string>, std::allocator<std::string> >*))(auto_drainer_t::lock_t, fifo_enforcer_sink_t*, fifo_enforcer_write_token_t const&, btree_loc_info_t const&, one_replace_t, ql::configured_limits_t const&, promise_t<superblock_t*>*, rdb_modification_report_cb_t*, bool, ql::datum_t*, profile::trace_t*, std::set<std::string, std::less<std::string>, std::allocator<std::string> >*)> >::_M_invoke(std::_Any_data const&) at ??:?\n16: coro_pool_t<std::function<void ()> >::worker_run(std::function<void ()>, auto_drainer_t::lock_t) at ??:?\n17: callable_action_instance_t<std::_Bind<std::_Mem_fn<void (coro_pool_t<std::function<void ()> >::*)(std::function<void ()>, auto_drainer_t::lock_t)> (coro_pool_t<std::function<void ()> >*, std::function<void ()>, auto_drainer_t::lock_t)> >::run_action() at ??:?\n18: coro_t::run() at ??:?
2015-09-13T15:28:20.248535860 198.572963s error: Exiting.

An other example:

2015-09-13T15:50:39.321047629 1298.095524s error: Error in src/buffer_cache/alt.cc at line 718:
2015-09-13T15:50:39.360596908 1298.135073s error: Guarantee failed: [ret != repli_timestamp_t::invalid]
2015-09-13T15:50:39.398113181 1298.172589s error: Backtrace:
2015-09-13T15:50:40.017122232 1298.791599s error: Sun Sep 13 15:50:39 2015\n\n1: backtrace_t::backtrace_t() at ??:?\n2: format_backtrace(bool) at ??:?\n3: report_fatal_error(char const*, int, char const*, ...) at ??:?\n4: buf_lock_t::get_recency() const at ??:?\n5: btree_depth_first_traversal(counted_t<counted_buf_lock_and_read_t>, key_range_t const&, depth_first_traversal_callback_t*, access_t, direction_t, btree_key_t const*, btree_key_t const*, signal_t*) at ??:?\n6: btree_depth_first_traversal(counted_t<counted_buf_lock_and_read_t>, key_range_t const&, depth_first_traversal_callback_t*, access_t, direction_t, btree_key_t const*, btree_key_t const*, signal_t*) at ??:?\n7: btree_depth_first_traversal(counted_t<counted_buf_lock_and_read_t>, key_range_t const&, depth_first_traversal_callback_t*, access_t, direction_t, btree_key_t const*, btree_key_t const*, signal_t*) at ??:?\n8: btree_depth_first_traversal(counted_t<counted_buf_lock_and_read_t>, key_range_t const&, depth_first_traversal_callback_t*, access_t, direction_t, btree_key_t const*, btree_key_t const*, signal_t*) at ??:?\n9: btree_depth_first_traversal(superblock_t*, key_range_t const&, depth_first_traversal_callback_t*, access_t, direction_t, release_superblock_t, signal_t*) at ??:?\n10: btree_concurrent_traversal(superblock_t*, key_range_t const&, concurrent_traversal_callback_t*, direction_t, release_superblock_t) at ??:?\n11: rdb_rget_slice(btree_slice_t*, key_range_t const&, superblock_t*, ql::env_t*, ql::batchspec_t const&, std::vector<boost::variant<ql::map_wire_func_t, ql::group_wire_func_t, ql::filter_wire_func_t, ql::concatmap_wire_func_t, ql::distinct_wire_func_t, ql::zip_wire_func_t, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::allocator<boost::variant<ql::map_wire_func_t, ql::group_wire_func_t, ql::filter_wire_func_t, ql::concatmap_wire_func_t, ql::distinct_wire_func_t, ql::zip_wire_func_t, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > const&, boost::optional<boost::variant<ql::count_wire_func_t, ql::sum_wire_func_t, ql::avg_wire_func_t, ql::min_wire_func_t, ql::max_wire_func_t, ql::reduce_wire_func_t, ql::limit_read_t, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > const&, sorting_t, rget_read_response_t*, release_superblock_t) at ??:?\n12: do_read(ql::env_t*, store_t*, btree_slice_t*, real_superblock_t*, rget_read_t const&, rget_read_response_t*, release_superblock_t) at ??:?\n13: rdb_read_visitor_t::operator()(rget_read_t const&) at ??:?\n14: store_t::protocol_read(read_t const&, read_response_t*, real_superblock_t*, signal_t*) at ??:?\n15: store_t::read(read_t const&, read_response_t*, read_token_t*, signal_t*) at ??:?\n16: replica_t::do_read(read_t const&, state_timestamp_t, signal_t*, read_response_t*) at ??:?\n17: primary_dispatcher_t::read(read_t const&, fifo_enforcer_sink_t::exit_read_t*, order_token_t, signal_t*, read_response_t*) at ??:?\n18: primary_execution_t::on_read(read_t const&, fifo_enforcer_sink_t::exit_read_t*, order_token_t, signal_t*, read_response_t*, admin_err_t*) at ??:?\n19: primary_query_server_t::client_t::perform_request(boost::variant<primary_query_bcard_t::read_request_t, primary_query_bcard_t::write_request_t, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&, signal_t*) at ??:?\n20: multi_client_server_t<boost::variant<primary_query_bcard_t::read_request_t, primary_query_bcard_t::write_request_t, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, primary_query_server_t*, primary_query_server_t::client_t>::client_t::on_request(signal_t*, boost::variant<primary_query_bcard_t::read_request_t, primary_query_bcard_t::write_request_t, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) at ??:?\n21: mailbox_t<void (boost::variant<primary_query_bcard_t::read_request_t, primary_query_bcard_t::write_request_t, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>)>::read_impl_t::read(read_stream_t*, signal_t*) at ??:?\n22: /usr/bin/rethinkdb() [0x944a02] at 0x944a02 ()\n23: coro_t::run() at ??:?
2015-09-13T15:50:40.069366756 1298.843842s error: Exiting.

r.db('test').table('maxMindBlocksIPv4').config():

{
"db":  "test" ,
"durability":  "hard" ,
"id":  "5f369b90-ab69-4b8e-b2e1-a8d4209477c8" ,
"indexes": [
"toDelete"
] ,
"name":  "maxMindBlocksIPv4" ,
"primary_key":  "id" ,
"shards": [
{
"nonvoting_replicas": [ ],
"primary_replica":  "s1" ,
"replicas": [
"s1"
]
}
] ,
"write_acks":  "majority"
}

rethinkdb --version

rethinkdb 2.1.3~0trusty (GCC 4.8.2)

I run the server on Linux 3.16.3, Ubuntu 14.04.1 and had install RethinkDB from deb packages.

If you need more informations or a copy of the table, just let me know.

Thanks, Adrien

danielmewes commented 9 years ago

Hi Adrien (@Bacto), thanks for the report. Have you had this table before you upgraded to 2.1.3? If that's the case case, the crashes are probably consistent with the sort of corruption that this bug could have caused. The corruption would probably have happened before the upgrade, and only showed when you tried to delete all documents after upgrading to 2.1.3.

Could you make a copy of the rethinkdb data directory on the server that's experiencing the crash, after you've shut down the rethinkdb process on that server? I'd like to take a look at the data file to confirm that this isn't a different bug if possible. I can send you instructions on how to upload the data. Once you've made a copy, I recommend dropping the table and recreating it from scratch in order to avoid the crash in the future.

danielmewes commented 9 years ago

@Bacto One more question: Did the first crash leave a core file behind by any chance?

Bacto commented 9 years ago

Hi Daniel (@danielmewes),

Thank you for your fast reply.

I have deleted the original table yesterday and recreated it (on rethinkDB 2.1.3). Unfortunately I still have the problem. I've tried to delete datas and it seems to work:

r.db('test').table('maxMindBlocksIPv4').delete()

But it crashed with:

r.db('test').table('maxMindBlocksIPv4').update({ toDelete: true }).run({durability: 'soft'})
2015-09-15T09:52:37.580596923 151295.742160s error: Error in src/rdb_protocol/lazy_json.cc at line 22:
2015-09-15T09:52:37.620582133 151295.782145s error: Guarantee failed: [res == archive_result_t::SUCCESS] Deserialization of rdb value failed with error archive_result_t::RANGE_ERROR.
2015-09-15T09:52:37.646836021 151295.808399s error: Backtrace:
2015-09-15T09:52:38.551113880 151296.712678s error: Tue Sep 15 09:52:37 2015\n\n1: backtrace_t::backtrace_t() at ??:?\n2: format_backtrace(bool) at ??:?\n3: report_fatal_error(char const*, int, char const*, ...) at ??:?\n4: get_data(rdb_value_t const*, buf_parent_t) at ??:?\n5: lazy_json_t::get() const at ??:?\n6: rget_cb_t::handle_pair(scoped_key_value_t&&, concurrent_traversal_fifo_enforcer_signal_t) at ??:?\n7: concurrent_traversal_adapter_t::handle_pair_coro(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t) at ??:?\n8: callable_action_instance_t<std::_Bind<std::_Mem_fn<void (concurrent_traversal_adapter_t::*)(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> (concurrent_traversal_adapter_t*, scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> >::run_action() at ??:?\n9: coro_t::run() at ??:?

I don't have a core file but I can try to have one if you explain me how to generate it. I can send you a copy of the table too, just tell me how :)

danielmewes commented 9 years ago

Thanks, I'm going to send you an email in a moment with instructions on how to upload the data file.

It also sounds like you're able to reproduce the problem on a fresh table relatively quickly (given that you started with a new table just yesterday). What types of write operations other than the full-table delete() do you run on the maxMindBlocksIPv4 table?

Re-opening this, since there still seems to be an issue.

danielmewes commented 9 years ago

@Bacto actually I don't have your email address. Would you mind dropping me a quick message to daniel@rethinkdb.com ? I will reply with upload instructions.

Bacto commented 9 years ago

Thanks @danielmewes, I've just sent you an email.

You're lucky, I don't do a lot of operations on this table.

The script do this:

Flag all entries as "toDelete":

r.table('maxMindBlocksIPv4')
    .update({ toDelete: true })
    .run({durability: 'soft'});

Insert new entries by groups of 100.000 with a total at the end of ~2.5M entries:

r.table('maxMindBlocksIPv4')
    .insert(toInsert)
    .run({durability: 'soft'});

Delete all entries that was flagged as "toDelete":

r.table('maxMindBlocksIPv4')
    .getAll(true, { index: 'toDelete' })
    .delete()
    .run({durability: 'soft'});

For information, I use rethinkdbdash on nodeJS. I have only one second index on "toDelete".

danielmewes commented 9 years ago

Thanks for the description of your workload @Bacto . How large are the documents approximately? Do you have an example document that you can share?

danielmewes commented 9 years ago

Ok I've been able to reproduce this. So there's definitely a second issue in the btree leaf node code. Have to investigate more to find the exact cause of the corruption.

danielmewes commented 9 years ago

Update: Testing a fix now. We should be able to get a bug fix release out in the next few days.

danielmewes commented 9 years ago

The fix is not in v2.1.x and next. Will ship with RethinkDB 2.1.4. The internal code review number was 3226.

danielmewes commented 9 years ago

@Bacto We just pushed RethinkDB 2.1.4 to the repositories. Please let us know in case you still encounter any issues after the update.

Bacto commented 9 years ago

Hi @danielmewes, I have just upgrade to 2.1.4 and I still have the issue :(

2015-09-17T09:34:21.922830350 320.014454s error: Error in src/rdb_protocol/lazy_json.cc at line 22:
2015-09-17T09:34:22.028834984 320.120459s error: Guarantee failed: [res == archive_result_t::SUCCESS] Deserialization of rdb value failed with error archive_result_t::RANGE_ERROR.
2015-09-17T09:34:22.067034201 320.158659s error: Backtrace:
2015-09-17T09:34:22.367373396 320.458999s error: Thu Sep 17 09:34:22 2015\n\n1: backtrace_t::backtrace_t() at ??:?\n2: format_backtrace(bool) at ??:?\n3: report_fatal_error(char const*, int, char const*, ...) at ??:?\n4: get_data(rdb_value_t const*, buf_parent_t) at ??:?\n5: lazy_json_t::get() const at ??:?\n6: rget_cb_t::handle_pair(scoped_key_value_t&&, concurrent_traversal_fifo_enforcer_signal_t) at ??:?\n7: concurrent_traversal_adapter_t::handle_pair_coro(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t) at ??:?\n8: callable_action_instance_t<std::_Bind<std::_Mem_fn<void (concurrent_traversal_adapter_t::*)(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> (concurrent_traversal_adapter_t*, scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> >::run_action() at ??:?\n9: coro_t::run() at ??:?

I'll send you the table in a few minutes.

Bacto commented 9 years ago

I've sent you the table. For information, it crashed too when I tried to dump the table with:

rethinkdb dump -e test.maxMindBlocksIPv4
2015-09-17T09:38:34.331210534 213.217966s error: Error in src/rdb_protocol/lazy_json.cc at line 22:
2015-09-17T09:38:34.362081851 213.248837s error: Guarantee failed: [res == archive_result_t::SUCCESS] Deserialization of rdb value failed with error archive_result_t::RANGE_ERROR.
2015-09-17T09:38:34.389654434 213.276410s error: Backtrace:
2015-09-17T09:38:34.764616226 213.651373s error: Thu Sep 17 09:38:34 2015\n\n1: backtrace_t::backtrace_t() at ??:?\n2: format_backtrace(bool) at ??:?\n3: report_fatal_error(char const*, int, char const*, ...) at ??:?\n4: get_data(rdb_value_t const*, buf_parent_t) at ??:?\n5: lazy_json_t::get() const at ??:?\n6: rget_cb_t::handle_pair(scoped_key_value_t&&, concurrent_traversal_fifo_enforcer_signal_t) at ??:?\n7: concurrent_traversal_adapter_t::handle_pair_coro(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t) at ??:?\n8: callable_action_instance_t<std::_Bind<std::_Mem_fn<void (concurrent_traversal_adapter_t::*)(scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> (concurrent_traversal_adapter_t*, scoped_key_value_t*, semaphore_acq_t*, fifo_enforcer_write_token_t, auto_drainer_t::lock_t)> >::run_action() at ??:?\n9: coro_t::run() at ??:?
danielmewes commented 9 years ago

@Bacto Is that with the existing table? Unfortunately the patch isn't usually able to revert corruptions that have already happened, and that found their way to disk. It appears that this is the case here. Unfortunately you'll have to recreate the table again.

With your data file, we can attempt to recover most of the data (I suspect that only a small number of documents will be corrupted). Please let me know if you had important data in the table that you would like to have recovered, and I'll see to get you the intact data by tomorrow.

My apologies again for all the inconvenience.

Bacto commented 9 years ago

Oh ok I thought that the patch will correct the table.

I don't care about the table datas. I will destroy the table and recreate it and I'll tell you if it's working now :)

oldlam commented 9 years ago

I'm also facing the segmentation fault here, with both 2.1.1 and 2.1.4 on OS X, fresh data, to reproduce this, create a table "user", then:

r.table('users').indexCreate('time', r.row('createdAt'))
r.table('users').insert({name: 'aaa', createdAt: r.now()})
r.table('users').insert({name: 'bbb', createdAt: r.now()})

r.table('users').orderBy({index: r.desc('time')}).limit(2).changes()

r.table('users').filter({name: 'bbb'}).update({name: 'zzz'}) // this will produce the crash

here is the error message I got after the crash:

error: Error in src/arch/runtime/thread_pool.cc at line 359:
error: Segmentation fault from reading the address 0x8.
error: Backtrace:
error: Thu Sep 17 19:27:47 2015
error: Exiting.

EDIT: just downloaded 2.0.5 and do the above steps again, it's working without problem.

Thanks!

Bacto commented 9 years ago

@danielmewes I don't have the problem after destroying / creating the table. Thanks a lot! :-)

danielmewes commented 9 years ago

@oldlam Thanks for mentioning that and for the steps to reproduce. On Linux I'm not getting the seg fault with those queries, but the changefeed isn't properly working either. It looks like there's a bug that yields different types of failures on different systems.

I think this is a separete bug though. I copied your report over to https://github.com/rethinkdb/rethinkdb/issues/4850