error when running rebalance

haraldschilly commented 8 years ago

2 out of 6 nodes crashed when rebalancing. minutes later everyone is happy again, but that's still bad. I found one stacktrace, that might be helpful...

version is 2.5.5

1 [0xad99da]: backtrace_t::backtrace_t() at ??:?
2 [0xad9eba]: format_backtrace[abi:cxx11](bool) at ??:?n3 [0xd66626]: report_fatal_error(char const*, int, char const*, ...) at ??:?
4 [0x99298e]: store_t::get_or_make_changefeed_server(hash_region_t<key_range_t> const&) at ??:?
5 [0x997cb8
]: rdb_read_visitor_t::operator()(changefeed_subscribe_t const&) at ??:?
6 [0x99b32f]: boost::detail::variant::invoke_visitor<rdb_read_visitor_t>::result_type boost::detail::variant::vis
itation_impl<mpl_::int_<0>, boost::detail::variant::visitation_impl_step<boost::mpl::l_iter<boost::mpl::l_item<mpl_::long_<10l>, point_read_t, boost::mpl::l_item<mpl_::long_<9l>, rget_rea
d_t, boost::mpl::l_item<mpl_::long_<8l>, intersecting_geo_read_t, boost::mpl::l_item<mpl_::long_<7l>, nearest_geo_read_t, boost::mpl::l_item<mpl_::long_<6l>, changefeed_subscribe_t, boost
::mpl::l_item<mpl_::long_<5l>, changefeed_stamp_t, boost::mpl::l_item<mpl_::long_<4l>, changefeed_limit_subscribe_t, boost::mpl::l_item<mpl_::long_<3l>, changefeed_point_stamp_t, boost::m
pl::l_item<mpl_::long_<2l>, distribution_read_t, boost::mpl::l_item<mpl_::long_<1l>, dummy_read_t, boost::mpl::l_end> > > > > > > > > > >, boost::mpl::l_iter<boost::mpl::l_end> >, boost::
detail::variant::invoke_visitor<rdb_read_visitor_t>, void const*, boost::variant<point_read_t<rget_read_t, intersecting_geo_read_t, nearest_geo_read_t, changefeed_subscribe_t, changefeed_
stamp_t, changefeed_limit_subscribe_t, changefeed_point_stamp_t, distribution_read_t, dummy_read_t> >::has_fallback_type_>(int, int, boost::variant<point_read_t<rget_read_t, intersecting_
geo_read_t, nearest_geo_read_t, changefeed_subscribe_t, changefeed_stamp_t, changefeed_limit_subscribe_t, changefeed_point_stamp_t, distribution_read_t, dummy_read_t> >::has_fallback_type
_&, void const*, mpl_::bool_<false>, boost::variant<point_read_t<rget_read_t, intersecting_geo_read_t, nearest_geo_read_t, changefeed_subscribe_t, changefeed_stamp_t, changefeed_limit_sub
scribe_t, changefeed_point_stamp_t, distribution_read_t, dummy_read_t> >::has_fallback_type_, mpl_::int_<0>*, boost::detail::variant::visitation_impl_step<boost::mpl::l_iter<boost::mpl::l
_item<mpl_::long_<10l>, point_read_t, boost::mpl::l_item<mpl_::long_<9l>, rget_read_t, boost::mpl::l_item<mpl_::long_<8l>, intersecting_geo_read_t, boost::mpl::l_item<mpl_::long_<7l>, nea
rest_geo_read_t, boost::mpl::l_item<mpl_::long_<6l>, changefeed_subscribe_t, boost::mpl::l_item<mpl_::long_<5l>, changefeed_stamp_t, boost::mpl::l_item<mpl_::long_<4l>, changefeed_limit_s
ubscribe_t, boost::mpl::l_item<mpl_::long_<3l>, changefeed_point_stamp_t, boost::mpl::l_item<mpl_::long_<2l>, distribution_read_t, boost::mpl::l_item<mpl_::long_<1l>, dummy_read_t, boost:
:mpl::l_end> > > > > > > > > > >, boost::mpl::l_iter<boost::mpl::l_end> >*) at ??:?
7 [0x993947]: store_t::protocol_read(read_t const&, read_response_t*, real_superblock_t*, signal_t*) a
t ??:?
8 [0x81b468]: store_t::read(read_t const&, read_response_t*, read_token_t*, signal_t*) at ??:?
9 [0xb873d6]: replica_t::do_read(read_t const&, state_timestamp_t, signal_t*, read_
response_t*) at ??:?
10 [0xb7bbd4]: primary_dispatcher_t::read(read_t const&, fifo_enforcer_sink_t::exit_read_t*, order_token_t, signal_t*, read_response_t*) at ??:?
11 [0xd28a41]: prim
ary_execution_t::on_read(read_t const&, fifo_enforcer_sink_t::exit_read_t*, order_token_t, signal_t*, read_response_t*, admin_err_t*) at ??:?
12 [0xba282d]: primary_query_server_t::clien
t_t::perform_request(boost::variant<primary_query_bcard_t::read_request_t, primary_query_bcard_t::write_request_t> const&, signal_t*) at ??:?
13 [0xba9256]: multi_client_server_t<boost::
variant<primary_query_bcard_t::read_request_t, primary_query_bcard_t::write_request_t>, primary_query_server_t*, primary_query_server_t::client_t>::client_t::on_request(signal_t*, boost::
variant<primary_query_bcard_t::read_request_t, primary_query_bcard_t::write_request_t> const&) at ??:?
14 [0xba8d79]: mailbox_t<void (boost::variant<primary_query_bcard_t::read_request_t
, primary_query_bcard_t::write_request_t>)>::read_impl_t::read(read_stream_t*, signal_t*) at ??:?
15 [0xa07732]: /usr/bin/rethinkdb() [0xa07732] at 0xa07732 ()
16 [0x9e7102]: coro_t::ru
n() at ??:?

danielmewes commented 8 years ago

Thanks for the report @haraldschilly . I thought we had this under control, but apparently not. :-(

Is there an error message above the backtrace in the log?

danielmewes commented 8 years ago

(pinging @mlucy )

haraldschilly commented 8 years ago

oh sorry, I missed that there is some part above! That looks indeed helpful for you:

2016-03-30T08:42:13.182277319 2773091.448582s info: Table c99153c0-432d-40f0-8626-f0a24520dc72: Configuration is changing.
2016-03-30T08:45:57.948470411 2773316.214775s info: Table 05c83381-98cf-4c3d-af1b-1cbe1a3e4b69: Configuration is changing.
2016-03-30T08:46:00.494933067 2773318.761238s info: Table 4d50504e-a067-4979-ae80-3ca335f489b0: Configuration is changing.
2016-03-30T08:46:14.222710664 2773332.489015s info: Table 05c83381-98cf-4c3d-af1b-1cbe1a3e4b69: Configuration is changing.
2016-03-30T08:47:29.684079835 2773407.950385s error: Error in src/rdb_protocol/store.cc at line 960:
2016-03-30T08:47:29.685241407 2773407.951546s error: Guarantee failed: [!pair.first.inner.overlaps(region.inner)]
2016-03-30T08:47:29.685273472 2773407.951578s error: Backtrace:
2016-03-30T08:47:29.686709725 2773407.953014s error: Error in src/rdb_protocol/store.cc at line 960:
2016-03-30T08:47:29.686730414 2773407.953035s error: Guarantee failed: [!pair.first.inner.overlaps(region.inner)]
2016-03-30T08:47:29.686746154 2773407.953050s error: Backtrace:

danielmewes commented 8 years ago

Thanks, that error should indeed be helpful.

mlucy commented 8 years ago

A potential fix for this is in next (OTS by @danielmewes). We think the bug might be caused by calling note_reshard in the destructor of local_replicator_t, which previously happened before that object's auto-drainer was destroyed, meaning that changefeed_subscribe_ts on the old region might still be queued up and wind up happening after note_reshard.

danielmewes commented 8 years ago

I'm going to close this for book-keeping reasons. If we see the issue again after RethinkDB 2.3.0, we will re-open it.

rethinkdb / rethinkdb

error when running rebalance #5576