Closed twmb closed 2 years ago
This is a real bug, one of the redpanda nodes (docker_n_33) crashed:
TRACE 2021-10-12 18:01:14,081 [shard 2] cluster - rm_partition_frontend.cc:238 - processing name:begin_tx, ntp:{kafka/tx-topic/0}, pid:{producer_identity: id=1, epoch=0}, tx_seq:1
WARN 2021-10-12 18:01:14,081 [shard 1] cluster - rm_partition_frontend.cc:282 - rm_stm::begin_tx({kafka/tx-topic/0},...) failed with cluster::tx_errc:1
*** stack smashing detected ***: terminated
This crash is happening on the very first produce to the newly created topic, so probably not related to https://github.com/vectorizedio/redpanda/issues/2602 (the issue for which this test was added).
I think I found a potential root cause. Around rm_partition_frontend.cc:282
we invoke tx_helpers::sleep_abortable which in its turn calls ss::sleep_abortable. The latter generates noise in the log but doesn't propagate an exception to upper later:
TRACE 2021-10-13 17:31:02,269 [shard 2] exception - Throw exception at:
0x34380f4 0x311187a 0x2a1780e0bda7 0xff6656 0x137f977 0x31b2866 0x3237ece 0x323986b 0x31b5a39 0x320551d 0x31631ff /lib/x86_64-linux-gnu/libpthread.so.0+0x944f /lib/x86_64-linux-gnu/libc.so.6+0x117d52
TRACE 2021-10-13 17:31:02,269 [shard 2] exception - Throw exception at:
0x34380f4 0x311187a 0x2a1780e0c1d2 /home/denis/vectorized/redpanda/vbuild/release/clang/rp_deps_install/lib/libc++.so.1+0x47e38 0x31266cf 0x31b2214 0x31b56d7 0x320551d 0x31631ff /lib/x86_64-linux-gnu/libpthread.so.0+0x944f /lib/x86_64-linux-gnu/libc.so.6+0x117d52
--------
seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void> seastar::future<void>::handle_exception<seastar::future<void> seastar::sleep_abortable<std::__1::chrono::steady_clock>(std::__1::chrono::steady_clock::duration)::'lambda'(std::exception_ptr)>(std::__1::chrono::steady_clock&&)::'lambda'(std::__1::chrono::steady_clock&&), seastar::futurize<std::__1::chrono::steady_clock>::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void> seastar::future<void>::handle_exception<seastar::future<void> seastar::sleep_abortable<std::__1::chrono::steady_clock>(std::__1::chrono::steady_clock::duration)::'lambda'(std::exception_ptr)>(std::__1::chrono::steady_clock&&)::'lambda'(std::__1::chrono::steady_clock&&)>(seastar::future<void> seastar::future<void>::handle_exception<seastar::future<void> seastar::sleep_abortable<std::__1::chrono::steady_clock>(std::__1::chrono::steady_clock::duration)::'lambda'(std::exception_ptr)>(std::__1::chrono::steady_clock&&)::'lambda'(std::__1::chrono::steady_clock&&)&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::future<void> seastar::future<void>::handle_exception<seastar::future<void> seastar::sleep_abortable<std::__1::chrono::steady_clock>(std::__1::chrono::steady_clock::duration)::'lambda'(std::exception_ptr)>(std::__1::chrono::steady_clock&&)::'lambda'(std::__1::chrono::steady_clock&&)&, seastar::future_state<seastar::internal::monostate>&&), void>
--------
seastar::internal::coroutine_traits_base<bool>::promise_type
--------
seastar::internal::coroutine_traits_base<cluster::begin_tx_reply>::promise_type
--------
seastar::continuation<seastar::internal::promise_base_with_type<std::__1::vector<cluster::begin_tx_reply, std::__1::allocator<cluster::begin_tx_reply> > >, seastar::internal::extract_values_from_futures_vector<seastar::future<cluster::begin_tx_reply> >::future_type seastar::internal::complete_when_all<seastar::internal::extract_values_from_futures_vector<seastar::future<cluster::begin_tx_reply> >, seastar::future<cluster::begin_tx_reply> >(std::__1::vector<seastar::future<cluster::begin_tx_reply>, std::__1::allocator<seastar::future<cluster::begin_tx_reply> > >&&, std::__1::vector<seastar::future<cluster::begin_tx_reply>, std::__1::allocator<seastar::future<cluster::begin_tx_reply> > >::iterator)::'lambda'(seastar::internal::extract_values_from_futures_vector<seastar::future<cluster::begin_tx_reply> >), seastar::futurize<seastar::internal::extract_values_from_futures_vector<seastar::future<cluster::begin_tx_reply> > >::type seastar::future<cluster::begin_tx_reply>::then_wrapped_nrvo<seastar::future<std::__1::vector<cluster::begin_tx_reply, std::__1::allocator<cluster::begin_tx_reply> > >, seastar::internal::extract_values_from_futures_vector<seastar::future<cluster::begin_tx_reply> >::future_type seastar::internal::complete_when_all<seastar::internal::extract_values_from_futures_vector<seastar::future<cluster::begin_tx_reply> >, seastar::future<cluster::begin_tx_reply> >(std::__1::vector<seastar::future<cluster::begin_tx_reply>, std::__1::allocator<seastar::future<cluster::begin_tx_reply> > >&&, std::__1::vector<seastar::future<cluster::begin_tx_reply>, std::__1::allocator<seastar::future<cluster::begin_tx_reply> > >::iterator)::'lambda'(seastar::internal::extract_values_from_futures_vector<seastar::future<cluster::begin_tx_reply> >)>(seastar::future<cluster::begin_tx_reply>&&)::'lambda'(seastar::internal::promise_base_with_type<std::__1::vector<cluster::begin_tx_reply, std::__1::allocator<cluster::begin_tx_reply> > >&&, seastar::internal::extract_values_from_futures_vector<seastar::future<cluster::begin_tx_reply> >::future_type seastar::internal::complete_when_all<seastar::internal::extract_values_from_futures_vector<seastar::future<cluster::begin_tx_reply> >, seastar::future<cluster::begin_tx_reply> >(std::__1::vector<seastar::future<cluster::begin_tx_reply>, std::__1::allocator<seastar::future<cluster::begin_tx_reply> > >&&, std::__1::vector<seastar::future<cluster::begin_tx_reply>, std::__1::allocator<seastar::future<cluster::begin_tx_reply> > >::iterator)::'lambda'(seastar::internal::extract_values_from_futures_vector<seastar::future<cluster::begin_tx_reply> >)&, seastar::future_state<cluster::begin_tx_reply>&&), cluster::begin_tx_reply>
--------
seastar::continuation<seastar::internal::promise_base_with_type<cluster::add_paritions_tx_reply>, cluster::tx_gateway_frontend::do_add_partition_to_tx(cluster::tm_transaction, seastar::shared_ptr<cluster::tm_stm>, cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_7, seastar::future<cluster::add_paritions_tx_reply> seastar::future<std::__1::vector<cluster::begin_tx_reply, std::__1::allocator<cluster::begin_tx_reply> > >::then_impl_nrvo<cluster::tx_gateway_frontend::do_add_partition_to_tx(cluster::tm_transaction, seastar::shared_ptr<cluster::tm_stm>, cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_7, seastar::future<cluster::add_paritions_tx_reply> >(cluster::tx_gateway_frontend::do_add_partition_to_tx(cluster::tm_transaction, seastar::shared_ptr<cluster::tm_stm>, cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_7&&)::'lambda'(seastar::internal::promise_base_with_type<cluster::add_paritions_tx_reply>&&, cluster::tx_gateway_frontend::do_add_partition_to_tx(cluster::tm_transaction, seastar::shared_ptr<cluster::tm_stm>, cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_7&, seastar::future_state<std::__1::vector<cluster::begin_tx_reply, std::__1::allocator<cluster::begin_tx_reply> > >&&), std::__1::vector<cluster::begin_tx_reply, std::__1::allocator<cluster::begin_tx_reply> > >
--------
seastar::internal::coroutine_traits_base<cluster::add_paritions_tx_reply>::promise_type
--------
seastar::continuation<seastar::internal::promise_base_with_type<cluster::add_paritions_tx_reply>, seastar::future<cluster::add_paritions_tx_reply>::finally_body<auto seastar::futurize<std::__1::result_of<cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5::operator()(cluster::tx_gateway_frontend&) const::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>) const::'lambda'() ()>::type>::type seastar::with_semaphore<seastar::semaphore_default_exception_factory, cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5::operator()(cluster::tx_gateway_frontend&) const::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>) const::'lambda'(), std::__1::chrono::steady_clock>(seastar::basic_semaphore<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>&, unsigned long, cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5::operator()(cluster::tx_gateway_frontend&) const::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>) const::'lambda'()&&)::'lambda'(seastar::semaphore_default_exception_factory)::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock> >(seastar::semaphore_default_exception_factory)::'lambda'(), false>, seastar::futurize<seastar::semaphore_default_exception_factory>::type seastar::future<cluster::add_paritions_tx_reply>::then_wrapped_nrvo<seastar::future<cluster::add_paritions_tx_reply>, seastar::future<cluster::add_paritions_tx_reply>::finally_body<auto seastar::futurize<std::__1::result_of<cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5::operator()(cluster::tx_gateway_frontend&) const::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>) const::'lambda'() ()>::type>::type seastar::with_semaphore<seastar::semaphore_default_exception_factory, cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5::operator()(cluster::tx_gateway_frontend&) const::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>) const::'lambda'(), std::__1::chrono::steady_clock>(seastar::basic_semaphore<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>&, unsigned long, cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5::operator()(cluster::tx_gateway_frontend&) const::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>) const::'lambda'()&&)::'lambda'(seastar::semaphore_default_exception_factory)::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock> >(seastar::semaphore_default_exception_factory)::'lambda'(), false> >(cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5::operator()(cluster::tx_gateway_frontend&) const::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>) const::'lambda'()&&)::'lambda'(seastar::internal::promise_base_with_type<cluster::add_paritions_tx_reply>&&, seastar::future<cluster::add_paritions_tx_reply>::finally_body<auto seastar::futurize<std::__1::result_of<cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5::operator()(cluster::tx_gateway_frontend&) const::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>) const::'lambda'() ()>::type>::type seastar::with_semaphore<seastar::semaphore_default_exception_factory, cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5::operator()(cluster::tx_gateway_frontend&) const::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>) const::'lambda'(), std::__1::chrono::steady_clock>(seastar::basic_semaphore<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>&, unsigned long, cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5::operator()(cluster::tx_gateway_frontend&) const::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>) const::'lambda'()&&)::'lambda'(seastar::semaphore_default_exception_factory)::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock> >(seastar::semaphore_default_exception_factory)::'lambda'(), false>&, seastar::future_state<cluster::add_paritions_tx_reply>&&), cluster::add_paritions_tx_reply>
--------
seastar::continuation<seastar::internal::promise_base_with_type<cluster::add_paritions_tx_reply>, seastar::future<cluster::add_paritions_tx_reply>::finally_body<cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5::operator()(cluster::tx_gateway_frontend&) const::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>) const::'lambda0'(), false>, seastar::futurize<seastar::future<cluster::add_paritions_tx_reply> >::type seastar::future<cluster::add_paritions_tx_reply>::then_wrapped_nrvo<seastar::future<cluster::add_paritions_tx_reply>, seastar::future<cluster::add_paritions_tx_reply>::finally_body<cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5::operator()(cluster::tx_gateway_frontend&) const::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>) const::'lambda0'(), false> >(seastar::future<cluster::add_paritions_tx_reply>::finally_body<cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5::operator()(cluster::tx_gateway_frontend&) const::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>) const::'lambda0'(), false>&&)::'lambda'(seastar::internal::promise_base_with_type<cluster::add_paritions_tx_reply>&&, seastar::future<cluster::add_paritions_tx_reply>::finally_body<cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5::operator()(cluster::tx_gateway_frontend&) const::'lambda'(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>)::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>) const::'lambda0'(), false>&, seastar::future_state<cluster::add_paritions_tx_reply>&&), cluster::add_paritions_tx_reply>
--------
seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::smp_message_queue::async_work_item<seastar::future<cluster::add_paritions_tx_reply> seastar::sharded<cluster::tx_gateway_frontend>::invoke_on<cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5, seastar::future<cluster::add_paritions_tx_reply> >(unsigned int, seastar::smp_submit_to_options, cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5&&)::'lambda'()>::run_and_dispose()::'lambda'(cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5), seastar::futurize<cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5>::type seastar::future<cluster::add_paritions_tx_reply>::then_wrapped_nrvo<void, seastar::smp_message_queue::async_work_item<seastar::future<cluster::add_paritions_tx_reply> seastar::sharded<cluster::tx_gateway_frontend>::invoke_on<cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5, seastar::future<cluster::add_paritions_tx_reply> >(unsigned int, seastar::smp_submit_to_options, cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5&&)::'lambda'()>::run_and_dispose()::'lambda'(cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5)>(seastar::smp_message_queue::async_work_item<seastar::future<cluster::add_paritions_tx_reply> seastar::sharded<cluster::tx_gateway_frontend>::invoke_on<cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5, seastar::future<cluster::add_paritions_tx_reply> >(unsigned int, seastar::smp_submit_to_options, cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5&&)::'lambda'()>::run_and_dispose()::'lambda'(cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5)&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::smp_message_queue::async_work_item<seastar::future<cluster::add_paritions_tx_reply> seastar::sharded<cluster::tx_gateway_frontend>::invoke_on<cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5, seastar::future<cluster::add_paritions_tx_reply> >(unsigned int, seastar::smp_submit_to_options, cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5&&)::'lambda'()>::run_and_dispose()::'lambda'(cluster::tx_gateway_frontend::add_partition_to_tx(cluster::add_paritions_tx_request, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)::$_5)&, seastar::future_state<cluster::add_paritions_tx_reply>&&), cluster::add_paritions_tx_reply>
However when we pass an abort_source to ss::sleep_abortable
the noise disappears. So my hunch is that this internal seaside error causes *** stack smashing detected ***: terminated
and to fix it we should use the ss::sleep_abortable
's overload with abort_sources
https://buildkite.com/vectorized/redpanda/builds/3308#e1b94ef1-9168-4026-8ff2-20926750cded (ran without 2647, though)
PR #2647 isn't conclusively behind this, but is a good candidate. Reopen if the failure reoccurs.
This failed again on the run after merging this PR, so it looks like something else is wrong tooi: https://buildkite.com/vectorized/redpanda/builds/3337#d002ed10-39da-41a9-bcf2-7b6066603e41
The test fails because a transaction coordinator's topic has replication factor 1 and when a node hosting it crushes redpanda can't process transactional requests anymore. Why a node crushes is unknown. I've created an issue to track that problem.
I've reproduced the error on a arm64 node and checked that incresing replication factor of the transactional topics makes TxFeatureFlagTest.test_disabling_transactions_after_they_being_used
pass even with a crushing node. Sending a PR unblocking the test.
https://buildkite.com/vectorized/redpanda/builds/3237#cc8deec5-9ad8-4eef-86af-c440ae06f576