Closed kargh closed 6 months ago
I'm going to prune the log messages a bit so that I'm not just copying hundreds of the same lines. This, I assume, is happening when other nodes are crashing/restarting.
Nov 29 17:31:30 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:30,877 [shard 11] rpc - transport.cc:172 - Request timeout to {host: 10.16.67.46, port: 33145}, correlation id: 9060 (1 in flight)
Nov 29 17:31:35 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:35,010 [shard 20] rpc - transport.cc:172 - Request timeout to {host: 10.16.67.45, port: 33145}, correlation id: 9100 (1 in flight)
Nov 29 17:31:35 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:35,085 [shard 37] rpc - Disconnected from server {host: 10.16.67.45, port: 33145}: std::__1::system_error (error system:104, read: Connection reset by peer)
Nov 29 17:31:35 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:35,101 [shard 41] rpc - Disconnected from server {host: 10.16.67.45, port: 33145}: std::__1::system_error (error system:104, read: Connection reset by peer)
Nov 29 17:31:35 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:35,118 [shard 20] rpc - Disconnected from server {host: 10.16.67.45, port: 33145}: std::__1::system_error (error system:104, read: Connection reset by peer)
Nov 29 17:31:36 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:36,221 [shard 11] rpc - transport.cc:172 - Request timeout to {host: 10.16.67.46, port: 33145}, correlation id: 9111 (1 in flight)
Nov 29 17:31:37 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:37,551 [shard 23] raft - [group_id:14529, {kafka/summarystats-dspstat/7}] vote_stm.cc:264 - becoming the leader term:14
Nov 29 17:31:37 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:37,552 [shard 23] storage - segment.cc:655 - Creating new segment /var/lib/redpanda/data/kafka/summarystats-dspstat/7_798925/513-14-v1.log
Nov 29 17:31:37 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:37,651 [shard 23] raft - [group_id:14529, {kafka/summarystats-dspstat/7}] vote_stm.cc:280 - became the leader term: 14
Nov 29 17:31:37 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:37,672 [shard 51] raft - [group_id:17138, {kafka/rxdata-videoevent/24}] vote_stm.cc:77 - vote reply from {id: {9}, revision: {799057}} - {term:{14}, target_node_id{id: {0}, revision: {799057}}, vote_granted: 1, log_ok:1}
Nov 29 17:31:37 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:37,883 [shard 62] r/heartbeat - heartbeat_manager.cc:209 - Closed unresponsive connection to 7
Nov 29 17:31:37 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:37,883 [shard 5] r/heartbeat - heartbeat_manager.cc:209 - Closed unresponsive connection to 7
Nov 29 17:31:37 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:37,883 [shard 20] r/heartbeat - heartbeat_manager.cc:209 - Closed unresponsive connection to 7
Nov 29 17:31:38 srv-01-03-404.iad1.trmr.io rpk[38137]: WARN 2022-11-29 17:31:38,452 [shard 0] cluster - metadata_dissemination_service.cc:474 - Error sending metadata update rpc::errc::exponential_backoff to 8
Nov 29 17:31:38 srv-01-03-404.iad1.trmr.io rpk[38137]: WARN 2022-11-29 17:31:38,452 [shard 0] cluster - metadata_dissemination_service.cc:474 - Error sending metadata update rpc::errc::exponential_backoff to 6
Nov 29 17:31:38 srv-01-03-404.iad1.trmr.io rpk[38137]: WARN 2022-11-29 17:31:38,452 [shard 0] cluster - metadata_dissemination_service.cc:474 - Error sending metadata update rpc::errc::unknown to 7
Nov 29 17:31:38 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:38,453 [shard 61] r/heartbeat - heartbeat_manager.cc:209 - Closed unresponsive connection to 7
Nov 29 17:31:38 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:38,453 [shard 23] raft - [group_id:10667, {kafka/rxdata-sspbidresponse/17}] consensus.cc:197 - [heartbeats_majority] Stepping down as leader in term 11, dirty offset 10272783
Nov 29 17:31:38 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:38,461 [shard 42] r/heartbeat - heartbeat_manager.cc:209 - Closed unresponsive connection to 7
Nov 29 17:31:38 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:38,478 [shard 51] r/heartbeat - heartbeat_manager.cc:209 - Closed unresponsive connection to 7
Nov 29 17:31:38 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:38,479 [shard 0] r/heartbeat - heartbeat_manager.cc:209 - Closed unresponsive connection to 7
Nov 29 17:31:41 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:41,583 [shard 8] r/heartbeat - heartbeat_manager.cc:209 - Closed unresponsive connection to 7
Nov 29 17:31:41 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:41,583 [shard 8] r/heartbeat - heartbeat_manager.cc:209 - Closed unresponsive connection to 6
Nov 29 17:31:41 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:41,584 [shard 32] r/heartbeat - heartbeat_manager.cc:209 - Closed unresponsive connection to 7
Nov 29 17:31:41 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:41,584 [shard 32] r/heartbeat - heartbeat_manager.cc:209 - Closed unresponsive connection to 8
Nov 29 17:31:41 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:41,587 [shard 18] r/heartbeat - heartbeat_manager.cc:209 - Closed unresponsive connection to 7
Nov 29 17:31:41 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:41,587 [shard 18] r/heartbeat -
heartbeat_manager.cc:209 - Closed unresponsive connection to 6
Nov 29 17:31:41 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:41,591 [shard 16] r/heartbeat - heartbeat_manager.cc:209 - Closed unresponsive connection to 8
Nov 29 17:31:41 srv-01-03-404.iad1.trmr.io rpk[38137]: WARN 2022-11-29 17:31:41,591 [shard 61] r/heartbeat - heartbeat_manager.cc:384 - Error dispatching heartbeats - seastar::named_semaphore_timed_out (Semaphore timed out: raft/connected)
Nov 29 17:31:41 srv-01-03-404.iad1.trmr.io rpk[38137]: WARN 2022-11-29 17:31:41,594 [shard 56] r/heartbeat - heartbeat_manager.cc:384 - Error dispatching heartbeats - seastar::named_semaphore_timed_out (Semaphore timed out: raft/connected)
Nov 29 17:31:41 srv-01-03-404.iad1.trmr.io rpk[38137]: WARN 2022-11-29 17:31:41,594 [shard 27] r/heartbeat - heartbeat_manager.cc:384 - Error dispatching heartbeats - seastar::named_semaphore_timed_out (Semaphore timed out: raft/connected)
Nov 29 17:31:41 srv-01-03-404.iad1.trmr.io rpk[38137]: WARN 2022-11-29 17:31:41,595 [shard 34] r/heartbeat - heartbeat_manager.cc:384 - Error dispatching heartbeats - seastar::named_semaphore_timed_out (Semaphore timed out: raft/connected)
Nov 29 17:31:41 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:41,597 [shard 28] r/heartbeat - heartbeat_manager.cc:209 - Closed unresponsive connection to 7
Nov 29 17:31:41 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:41,601 [shard 53] r/heartbeat - heartbeat_manager.cc:209 - Closed unresponsive connection to 6
Nov 29 17:31:41 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:41,601 [shard 53] r/heartbeat - heartbeat_manager.cc:209 - Closed unresponsive connection to 8
Nov 29 17:31:41 srv-01-03-404.iad1.trmr.io rpk[38137]: INFO 2022-11-29 17:31:41,601 [shard 53] r/heartbeat - heartbeat_manager.cc:209 - Closed unresponsive connection to 7
This is from Node 1, the previous messages were from Node 0.
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: WARN 2022-11-29 17:33:04,335 [shard 14] seastar - Exceptional future ignored: std::bad_alloc (std::bad_alloc), backtrace: 0x536e5b4 0x5050d02 0x4245b4b 0x484ab39 0x484a836 0x484b364 0x510b25f 0x510ef37 0x5152275 0x50acb4f /opt/redpanda/lib/libc.so.6+0x91016 /opt/redpanda/lib/libc.so.6+0x1166cf
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: --------
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: seastar::continuation<seastar::internal::promise_base_with_type<void>, storage::disk_log_impl::new_segment(detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> >, detail::base_named_type<long, model::model_raft_term_id_type, std::__1::integral_constant<bool, true> >, seastar::io_priority_class)::$_8, seastar::future<void> seastar::future<seastar::lw_shared_ptr<storage::segment> >::then_impl_nrvo<storage::disk_log_impl::new_segment(detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> >, detail::base_named_type<long, model::model_raft_term_id_type, std::__1::integral_constant<bool, true> >, seastar::io_priority_class)::$_8, seastar::future<void> >(storage::disk_log_impl::new_segment(detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> >, detail::base_named_type<long, model::model_raft_term_id_type, std::__1::integral_constant<bool, true> >, seastar::io_priority_class)::$_8&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, storage::disk_log_impl::new_segment(detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> >, detail::base_named_type<long, model::model_raft_term_id_type, std::__1::integral_constant<bool, true> >, seastar::io_priority_class)::$_8&, seastar::future_state<seastar::lw_shared_ptr<storage::segment> >&&), seastar::lw_shared_ptr<storage::segment> >
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: --------
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: seastar::continuation<seastar::internal::promise_base_with_type<void>, storage::disk_log_appender::operator()(model::record_batch&)::$_2::operator()() const::'lambda'(), seastar::future<void> seastar::future<void>::then_impl_nrvo<storage::disk_log_appender::operator()(model::record_batch&)::$_2::operator()() const::'lambda'(), seastar::future<void> >(storage::disk_log_appender::operator()(model::record_batch&)::$_2::operator()() const::'lambda'()&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, storage::disk_log_appender::operator()(model::record_batch&)::$_2::operator()() const::'lambda'()&, seastar::future_state<seastar::internal::monostate>&&), void>
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: --------
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: seastar::internal::do_until_state<storage::disk_log_appender::operator()(model::record_batch&)::$_3, storage::disk_log_appender::operator()(model::record_batch&)::$_2>
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: --------
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: seastar::continuation<seastar::internal::promise_base_with_type<seastar::bool_class<seastar::stop_iteration_tag> >, storage::disk_log_appender::operator()(model::record_batch&)::$_4, seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > seastar::future<void>::then_impl_nrvo<storage::disk_log_appender::operator()(model::record_batch&)::$_4, seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > >(storage::disk_log_appender::operator()(model::record_batch&)::$_4&&)::'lambda'(seastar::internal::promise_base_with_type<seastar::bool_class<seastar::stop_iteration_tag> >&&, storage::disk_log_appender::operator()(model::record_batch&)::$_4&, seastar::future_state<seastar::internal::monostate>&&), void>
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: --------
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: seastar::continuation<seastar::internal::promise_base_with_type<seastar::bool_class<seastar::stop_iteration_tag> >, seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > seastar::future<seastar::bool_class<seastar::stop_iteration_tag> >::handle_exception<storage::disk_log_appender::operator()(model::record_batch&)::$_5>(storage::disk_log_appender::operator()(model::record_batch&)::$_5&&)::'lambda'(storage::disk_log_appender::operator()(model::record_batch&)::$_5&&), seastar::futurize<storage::disk_log_appender::operator()(model::record_batch&)::$_5>::type seastar::future<seastar::bool_class<seastar::stop_iteration_tag> >::then_wrapped_nrvo<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> >, seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > seastar::future<seastar::bool_class<seastar::stop_iteration_tag> >::handle_exception<storage::disk_log_appender::operator()(model::record_batch&)::$_5>(storage::disk_log_appender::operator()(model::record_batch&)::$_5&&)::'lambda'(storage::disk_log_appender::operator()(model::record_batch&)::$_5&&)>(seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > seastar::future<seastar::bool_class<seastar::stop_iteration_tag> >::handle_exception<storage::disk_log_appender::operator()(model::record_batch&)::$_5>(storage::disk_log_appender::operator()(model::record_batch&)::$_5&&)::'lambda'(storage::disk_log_appender::operator()(model::record_batch&)::$_5&&)&&)::'lambda'(seastar::internal::promise_base_with_type<seastar::bool_class<seastar::stop_iteration_tag> >&&, seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > seastar::future<seastar::bool_class<seastar::stop_iteration_tag> >::handle_exception<storage::disk_log_appender::operator()(model::record_batch&)::$_5>(storage::disk_log_appender::operator()(model::record_batch&)::$_5&&)::'lambda'(storage::disk_log_appender::operator()(model::record_batch&)::$_5&&)&, seastar::future_state<seastar::bool_class<seastar::stop_iteration_tag> >&&), seastar::bool_class<seastar::stop_iteration_tag> >
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: --------
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: seastar::internal::coroutine_traits_base<seastar::bool_class<seastar::stop_iteration_tag> >::promise_type
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: --------
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: seastar::continuation<seastar::internal::promise_base_with_type<seastar::bool_class<seastar::stop_iteration_tag> >, seastar::future<seastar::bool_class<seastar::stop_iteration_tag> >::finally_body<auto model::record_batch_reader::impl::do_for_each_ref<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >)::'lambda'(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&)::operator()(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&) const::'lambda'(seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >&)::operator()(seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >&) const::'lambda'(), false>, seastar::futurize<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>::type seastar::future<seastar::bool_class<seastar::stop_iteration_tag> >::then_wrapped_nrvo<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> >, seastar::future<seastar::bool_class<seastar::stop_iteration_tag> >::finally_body<auto model::record_batch_reader::impl::do_for_each_ref<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >)::'lambda'(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&)::operator()(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&) const::'lambda'(seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >&)::operator()(seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >&) const::'lambda'(), false> >(seastar::future<seastar::bool_class<seastar::stop_iteration_tag> >::finally_body<auto model::record_batch_reader::impl::do_for_each_ref<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >)::'lambda'(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&)::operator()(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&) const::'lambda'(seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >&)::operator()(seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >&) const::'lambda'(), false>&&)::'lambda'(seastar::internal::promise_base_with_type<seastar::bool_class<seastar::stop_iteration_tag> >&&, seastar::future<seastar::bool_class<seastar::stop_iteration_tag> >::finally_body<auto model::record_batch_reader::impl::do_for_each_ref<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >)::'lambda'(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&)::operator()(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&) const::'lambda'(seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >&)::operator()(seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >&) const::'lambda'(), false>&, seastar::future_state<seastar::bool_class<seastar::stop_iteration_tag> >&&), seastar::bool_class<seastar::stop_iteration_tag> >
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: --------
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: seastar::internal::repeater<auto model::record_batch_reader::impl::do_action<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer, auto model::record_batch_reader::impl::do_for_each_ref<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >)::'lambda'(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&)>(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, auto model::record_batch_reader::impl::do_for_each_ref<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >)::'lambda'(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&)&&)::'lambda'()>
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: --------
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: seastar::continuation<seastar::internal::promise_base_with_type<storage::append_result>, auto model::record_batch_reader::impl::do_action<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer, auto model::record_batch_reader::impl::do_for_each_ref<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >)::'lambda'(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&)>(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, auto model::record_batch_reader::impl::do_for_each_ref<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >)::'lambda'(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&)&&)::'lambda0'(), auto model::record_batch_reader::impl::do_for_each_ref<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >)::'lambda'(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&) seastar::future<void>::then_impl_nrvo<auto model::record_batch_reader::impl::do_action<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer, auto model::record_batch_reader::impl::do_for_each_ref<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >)::'lambda'(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&)>(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, auto model::record_batch_reader::impl::do_for_each_ref<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >)::'lambda'(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&)&&)::'lambda0'(), seastar::future<storage::append_result> >(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&&)::'lambda'(seastar::internal::promise_base_with_type<storage::append_result>&&, auto model::record_batch_reader::impl::do_action<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer, auto model::record_batch_reader::impl::do_for_each_ref<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >)::'lambda'(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&)>(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, auto model::record_batch_reader::impl::do_for_each_ref<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >)::'lambda'(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&)&&)::'lambda0'()&, seastar::future_state<seastar::internal::monostate>&&), void>
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: --------
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: seastar::internal::do_with_state<std::__1::tuple<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>, seastar::future<storage::append_result> >
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: --------
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: seastar::continuation<seastar::internal::promise_base_with_type<storage::append_result>, seastar::future<storage::append_result>::finally_body<auto model::record_batch_reader::for_each_ref<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >) &&::'lambda'(), true>, seastar::futurize<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>::type seastar::future<storage::append_result>::then_wrapped_nrvo<seastar::future<storage::append_result>, seastar::future<storage::append_result>::finally_body<auto model::record_batch_reader::for_each_ref<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >) &&::'lambda'(), true> >(seastar::future<storage::append_result>::finally_body<auto model::record_batch_reader::for_each_ref<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >) &&::'lambda'(), true>&&)::'lambda'(seastar::internal::promise_base_with_type<storage::append_result>&&, seastar::future<storage::append_result>::finally_body<auto model::record_batch_reader::for_each_ref<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >) &&::'lambda'(), true>&, seastar::future_state<storage::append_result>&&), storage::append_result>
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: --------
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: seastar::continuation<seastar::internal::promise_base_with_type<std::__1::tuple<storage::append_result, std::__1::vector<raft::offset_configuration, std::__1::allocator<raft::offset_configuration> > > >, auto raft::details::for_each_ref_extract_configuration<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>(detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> >, model::record_batch_reader&&, raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >)::'lambda'(std::__1::vector<raft::offset_configuration, std::__1::allocator<raft::offset_configuration> >&)::operator()(std::__1::vector<raft::offset_configuration, std::__1::allocator<raft::offset_configuration> >&)::'lambda'(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer), seastar::future<std::__1::tuple<storage::append_result, std::__1::vector<raft::offset_configuration, std::__1::allocator<raft::offset_configuration> > > > seastar::future<storage::append_result>::then_impl_nrvo<auto raft::details::for_each_ref_extract_configuration<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>(detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> >, model::record_batch_reader&&, raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >)::'lambda'(std::__1::vector<raft::offset_configuration, std::__1::allocator<raft::offset_configuration> >&)::operator()(std::__1::vector<raft::offset_configuration, std::__1::allocator<raft::offset_configuration> >&)::'lambda'(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer), seastar::future<std::__1::tuple<storage::append_result, std::__1::vector<raft::offset_configuration, std::__1::allocator<raft::offset_configuration> > > > >(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer&&)::'lambda'(seastar::internal::promise_base_with_type<std::__1::tuple<storage::append_result, std::__1::vector<raft::offset_configuration, std::__1::allocator<raft::offset_configuration> > > >&&, auto raft::details::for_each_ref_extract_configuration<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer>(detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> >, model::record_batch_reader&&, raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >)::'lambda'(std::__1::vector<raft::offset_configuration, std::__1::allocator<raft::offset_configuration> >&)::operator()(std::__1::vector<raft::offset_configuration, std::__1::allocator<raft::offset_configuration> >&)::'lambda'(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::consumer)&, seastar::future_state<storage::append_result>&&), storage::append_result>
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: --------
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: seastar::internal::do_with_state<std::__1::tuple<std::__1::vector<raft::offset_configuration, std::__1::allocator<raft::offset_configuration> > >, seastar::future<std::__1::tuple<storage::append_result, std::__1::vector<raft::offset_configuration, std::__1::allocator<raft::offset_configuration> > > > >
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: --------
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: seastar::continuation<seastar::internal::promise_base_with_type<storage::append_result>, raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::$_32, seastar::future<storage::append_result> seastar::future<std::__1::tuple<storage::append_result, std::__1::vector<raft::offset_configuration, std::__1::allocator<raft::offset_configuration> > > >::then_impl_nrvo<raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::$_32, seastar::future<storage::append_result> >(raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::$_32&&)::'lambda'(seastar::internal::promise_base_with_type<storage::append_result>&&, raft::consensus::disk_append(model::record_batch_reader&&, seastar::bool_class<raft::update_last_quorum_index>)::$_32&, seastar::future_state<std::__1::tuple<storage::append_result, std::__1::vector<raft::offset_configuration, std::__1::allocator<raft::offset_configuration> > > >&&), std::__1::tuple<storage::append_result, std::__1::vector<raft::offset_configuration, std::__1::allocator<raft::offset_configuration> > > >
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: --------
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: seastar::continuation<seastar::internal::promise_base_with_type<raft::append_entries_reply>, raft::consensus::do_append_entries(raft::append_entries_request&&)::$_21, seastar::future<raft::append_entries_reply> seastar::future<storage::append_result>::then_impl_nrvo<raft::consensus::do_append_entries(raft::append_entries_request&&)::$_21, seastar::future<raft::append_entries_reply> >(raft::consensus::do_append_entries(raft::append_entries_request&&)::$_21&&)::'lambda'(seastar::internal::promise_base_with_type<raft::append_entries_reply>&&, raft::consensus::do_append_entries(raft::append_entries_request&&)::$_21&, seastar::future_state<storage::append_result>&&), storage::append_result>
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: --------
Nov 29 17:33:04 srv-01-03-405.iad1.trmr.io rpk[61594]: seastar::continuation<seastar::internal::promise_base_with_type<raft::append_entries_reply>, seastar::future<raft::append_entries_reply> seastar::future<raft::append_entries_reply>::handle_exception<raft::consensus::do_append_entries(raft::append_entries_request&&)::$_62>(raft::consensus::do_append_entries(raft::append_entries_request&&)::$_62&&)::'lambda'(raft::consensus::do_append_entries(raft::append_entries_request&&)::$_62&&), seastar::futurize<raft::consensus::do_append_entries(raft::append_entries_request&&)::$_62>::type seastar::future<raft::append_entries_reply>::then_wrapped_nrvo<seastar::future<raft::append_entries_reply>, seastar::future<raft::append_entries_reply> seastar::future<raft::append_entries_reply>::handle_exception<raft::consensus::do_append_entries(raft::append_entries_request&&)::$_62>(raft::consensus::do_append_entries(raft::append_entries_request&&)::$_62&&)::'lambda'(raft::consensus::do_append_entries(raft::append_entries_request&&)::$_62&&)>(seastar::future<raft::append_entries_reply> seastar::future<raft::append_entries_reply>::handle_exception<raft::consensus::do_append_entries(raft::append_entries_request&&)::$_62>(raft::consensus::do_append_entries(raft::append_entries_request&&)::$_62&&)::'lambda'(raft::consensus::do_append_entries(raft::append_entries_request&&)::$_62&&)&&)::'lambda'(seastar::internal::promise_base_with_type<raft::append_entries_reply>&&, seastar::future<raft::append_entries_reply> seastar::future<raft::append_entries_reply>::handle_exception<raft::consensus::do_append_entries(raft::append_entries_request&&)::$_62>(raft::consensus::do_append_entries(raft::append_entries_request&&)::$_62&&)::'lambda'(raft::consensus::do_append_entries(raft::append_entries_request&&)::$_62&&)&, seastar::future_state<raft::append_entries_reply>&&), raft::append_entries_reply>
I wonder if it has anything to do with the sysctl.conf settings I have in place. Do you have a list of recommended settings and values? I didn't see anything in the documentation. I could blow mine away and try with default plus whatever recommended changes you have.
Node 1 crash:
Nov 29 19:46:04 srv-01-03-405.iad1.trmr.io rpk[37465]: WARN 2022-11-29 19:46:04,856 [shard 8] tx - [{kafka/rx-bidrequest-stream/9}] - rm_stm.cc:1476 - [pid: {producer_identity: id=557410, epoch=0}] firs sequence in session must be zero, current seq: 11
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: WARN 2022-11-29 19:46:05,098 [shard 8] tx - [{kafka/rx-bidrequest-stream/9}] - rm_stm.cc:1476 - [pid: {producer_identity: id=537077, epoch=0}] firs sequence in session must be zero, current seq: 179
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: WARN 2022-11-29 19:46:05,098 [shard 8] tx - [{kafka/rx-bidrequest-stream/9}] - rm_stm.cc:1476 - [pid: {producer_identity: id=547830, epoch=0}] firs sequence in session must be zero, current seq: 155
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: WARN 2022-11-29 19:46:05,098 [shard 8] tx - [{kafka/rx-bidrequest-stream/9}] - rm_stm.cc:1476 - [pid: {producer_identity: id=539740, epoch=0}] firs sequence in session must be zero, current seq: 186
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: WARN 2022-11-29 19:46:05,099 [shard 8] tx - [{kafka/rx-bidrequest-stream/9}] - rm_stm.cc:1476 - [pid: {producer_identity: id=558453, epoch=0}] firs sequence in session must be zero, current seq: 36
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: WARN 2022-11-29 19:46:05,099 [shard 8] tx - [{kafka/rx-bidrequest-stream/9}] - rm_stm.cc:1476 - [pid: {producer_identity: id=559676, epoch=0}] firs sequence in session must be zero, current seq: 70
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: WARN 2022-11-29 19:46:05,100 [shard 8] tx - [{kafka/rx-bidrequest-stream/9}] - rm_stm.cc:1476 - [pid: {producer_identity: id=557873, epoch=0}] firs sequence in session must be zero, current seq: 50
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: Segmentation fault on shard 8.
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: Backtrace:
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x50ed126
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x5150a86
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: /opt/redpanda/lib/libc.so.6+0x42abf
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x514f535
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x51153fa
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x1c6b3cc
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x4f6ee3a
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x4f6a56c
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x214295a
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x213fe50
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x213f982
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x1f5c6cf
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x42f9f92
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x433e0c4
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x42ed817
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x42ec8fb
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x420980b
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x295e538
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x2900b68
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x2906e80
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x510b25f
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x510ef37
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x5152275
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: 0x50acb4f
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: /opt/redpanda/lib/libc.so.6+0x91016
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io rpk[37465]: /opt/redpanda/lib/libc.so.6+0x1166cf
Nov 29 19:46:05 srv-01-03-405.iad1.trmr.io systemd[1]: redpanda.service: main process exited, code=killed, status=11/SEGV
@kargh , this is the same cluster as the one that saw previous crashes yes? I ask because we were wondering if something buggy/corrupt has gotten into the controller log, somehow, and is resulting in crashes.
Is there any other trigger here? Any app that could have read from earlier/start of offset, etc?
We might need to turn on more logs to chase it, is that something you'd be able to help with?
This is the same cluster of servers, yes. I blew it away (data and app) about 10 days ago and installed 22.3.3. Everything ran well until today. This morning i upgraded to 22.3.4 and that went well. No issues, clean logs. The problems started after we pushed out a release for our app which recreates 15k new connections in a few minutes. Seeing the same behavior with this version as we did with 22.2.7.
With 22.2.7 we started seeing the seastar - Exceptional future ignored: std::bad_alloc
log messages on Nov 15th. On Nov 17th we pushed out a new release of our app and from that point forward, RP wouldn't stay stable for more than a few hours before it would start crash looping. On Nov 19th I completely blew it away and started over. By that point it would crash, start, and crash again within seconds.
We did not have any releases last week.
I can enable any logging that you would like.
We have several apps that consume from the cluster and mirrormaker also consuming and producing to it. I can dig into that further if you would like.
This issue with all the seastar log messages (https://github.com/redpanda-data/redpanda/issues/7570#issuecomment-1331418489) is happening pretty frequently on all the nodes now. In some cases I have more seastar messages than anything else.
@kargh would you be able to share the full logs (as an attachment) from the problematic nodes with crashes? Looks like there are multiple issues (bad_alloc and crash) and it is unclear if they are both related, full logs may help to stitch the timeline of events better.
--- for the record, here is the decoded backtrace from node 1 crash
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::'lambda'(seastar::frame)>(seastar::backtrace_buffer::append_backtrace()::'lambda'(seastar::frame)&&) at /v/build/v_deps_build/seastar-prefix/src/seastar/include/seastar/util/backtrace.hh:59
seastar::print_with_backtrace(char const*, bool) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:796
/opt/redpanda/lib/libc.so.6+0x42abf
std::__1::__deque_base<seastar::smp_message_queue::work_item*, std::__1::allocator<seastar::smp_message_queue::work_item*> >::end() at /vectorized/llvm/bin/../include/c++/v1/deque:1142
seastar::future<void> seastar::futurize<void>::invoke<seastar::smp_message_queue::submit_item(unsigned int, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::unique_ptr<seastar::smp_message_queue::work_item, std::__1::default_delete<seastar::smp_message_queue::work_item> >)::$_30, seastar::future<seastar::semaphore_units<seastar::named_semaphore_exception_factory, seastar::lowres_clock> > >(seastar::smp_message_queue::submit_item(unsigned int, std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::unique_ptr<seastar::smp_message_queue::work_item, std::__1::default_delete<seastar::smp_message_queue::work_item> >)::$_30&&, seastar::future<seastar::semaphore_units<seastar::named_semaphore_exception_factory, seastar::lowres_clock> >&&) at /v/build/v_deps_build/seastar-prefix/src/seastar/include/seastar/core/future.hh:2146
seastar::futurize<std::__1::invoke_result<std::__1::unique_ptr<seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >, std::__1::default_delete<seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> > > > >::type>::type seastar::smp_message_queue::submit<seastar::foreign_ptr<std::__1::unique_ptr<seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >, std::__1::default_delete<seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> > > > >::destroy_on(std::__1::unique_ptr<seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >, std::__1::default_delete<seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> > > >, unsigned int)::'lambda'()>(unsigned int, seastar::smp_submit_to_options, std::__1::unique_ptr<seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >, std::__1::default_delete<seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> > > >&&) at /vectorized/include/seastar/core/smp.hh:269
decltype(auto) std::__1::__variant_detail::__visitation::__base::__visit_alt<std::__1::__variant_detail::__dtor<std::__1::__variant_detail::__traits<seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >, model::record_batch_reader::foreign_data_t>, (std::__1::__variant_detail::_Trait)1>::__destroy()::'lambda'(std::__1::__variant_detail::__traits<seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >, model::record_batch_reader::foreign_data_t>&), std::__1::__variant_detail::__dtor<std::__1::__variant_detail::__traits<seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >, model::record_batch_reader::foreign_data_t>, (std::__1::__variant_detail::_Trait)1>&>(std::__1::__variant_detail::__traits<seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >, model::record_batch_reader::foreign_data_t>&&, (std::__1::__variant_detail::_Trait)1&&...) at /vectorized/llvm/bin/../include/c++/v1/variant:499
std::__1::default_delete<model::record_batch_reader::impl>::operator()(model::record_batch_reader::impl*) const at /vectorized/llvm/bin/../include/c++/v1/__memory/unique_ptr.h:54
raft::replicate_batcher::do_cache(std::__1::optional<detail::base_named_type<long, model::model_raft_term_id_type, std::__1::integral_constant<bool, true> > >, model::record_batch_reader, raft::consistency_level, std::__1::optional<std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> > >) at /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-04413b3a03dfa992b-1/redpanda/redpanda/src/v/raft/replicate_batcher.cc:111
raft::replicate_batcher::cache_and_wait_for_result(seastar::promise<void>, std::__1::optional<detail::base_named_type<long, model::model_raft_term_id_type, std::__1::integral_constant<bool, true> > >, model::record_batch_reader, raft::consistency_level, std::__1::optional<std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> > >) at /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-04413b3a03dfa992b-1/redpanda/redpanda/src/v/raft/replicate_batcher.cc:59
raft::replicate_batcher::replicate(std::__1::optional<detail::base_named_type<long, model::model_raft_term_id_type, std::__1::integral_constant<bool, true> > >, model::record_batch_reader, raft::consistency_level, std::__1::optional<std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> > >) at /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-04413b3a03dfa992b-1/redpanda/redpanda/src/v/raft/replicate_batcher.cc:40
raft::consensus::do_replicate(std::__1::optional<detail::base_named_type<long, model::model_raft_term_id_type, std::__1::integral_constant<bool, true> > >, model::record_batch_reader&&, raft::replicate_options) at /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-04413b3a03dfa992b-1/redpanda/redpanda/src/v/raft/consensus.cc:751
cluster::rm_stm::replicate_seq(model::batch_identity, model::record_batch_reader, raft::replicate_options, seastar::lw_shared_ptr<available_promise<> >) at /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-04413b3a03dfa992b-1/redpanda/redpanda/src/v/cluster/rm_stm.cc:1507
cluster::rm_stm::do_replicate(model::batch_identity, model::record_batch_reader, raft::replicate_options, seastar::lw_shared_ptr<available_promise<> >)::$_6::operator()(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>) at /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-04413b3a03dfa992b-1/redpanda/redpanda/src/v/cluster/rm_stm.cc:1029
seastar::future<boost::outcome_v2::basic_result<cluster::kafka_result, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<cluster::kafka_result, std::__1::error_code, void> > > seastar::futurize<seastar::future<boost::outcome_v2::basic_result<cluster::kafka_result, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<cluster::kafka_result, std::__1::error_code, void> > > >::invoke<cluster::rm_stm::do_replicate(model::batch_identity, model::record_batch_reader, raft::replicate_options, seastar::lw_shared_ptr<available_promise<> >)::$_6, seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock> >(cluster::rm_stm::do_replicate(model::batch_identity, model::record_batch_reader, raft::replicate_options, seastar::lw_shared_ptr<available_promise<> >)::$_6&&, seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::__1::chrono::steady_clock>&&) at /vectorized/include/seastar/core/future.hh:2149
cluster::rm_stm::replicate_in_stages(model::batch_identity, model::record_batch_reader, raft::replicate_options) at /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-04413b3a03dfa992b-1/redpanda/redpanda/src/v/cluster/rm_stm.cc:981
cluster::partition::replicate_in_stages(model::batch_identity, model::record_batch_reader&&, raft::replicate_options) at /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-04413b3a03dfa992b-1/redpanda/redpanda/src/v/cluster/partition.cc:249
kafka::replicated_partition::replicate(model::batch_identity, model::record_batch_reader&&, raft::replicate_options) at /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-04413b3a03dfa992b-1/redpanda/redpanda/src/v/kafka/server/replicated_partition.cc:263
kafka::partition_append(detail::base_named_type<int, model::model_partition_id_type, std::__1::integral_constant<bool, true> >, seastar::lw_shared_ptr<kafka::replicated_partition>, model::batch_identity, model::record_batch_reader, short, int, long, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >) at /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-04413b3a03dfa992b-1/redpanda/redpanda/src/v/kafka/server/handlers/produce.cc:186
decltype(static_cast<kafka::produce_topic_partition(kafka::produce_ctx&, kafka::topic_produce_data&, kafka::partition_produce_data&)::$_1>(fp)(static_cast<cluster::partition_manager&>(fp0))) std::__1::__invoke_constexpr<kafka::produce_topic_partition(kafka::produce_ctx&, kafka::topic_produce_data&, kafka::partition_produce_data&)::$_1, cluster::partition_manager&>(kafka::produce_topic_partition(kafka::produce_ctx&, kafka::topic_produce_data&, kafka::partition_produce_data&)::$_1&&, cluster::partition_manager&) at /vectorized/llvm/bin/../include/c++/v1/type_traits:3648
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:2330
seastar::reactor::do_run() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:2906
seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_95::operator()() const at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:4140
std::__1::__function::__value_func<void ()>::operator()() const at /vectorized/llvm/bin/../include/c++/v1/__functional/function.h:507
Thanks, Bharath. I missed bad_alloc. @kargh: you mentioned 10 nodes, but how much memory do you have on those nodes and specifically the ones that is spewing bad_alloc errors? You said the number of partitions/topics are low, but how many exactly? Something is spiking the memory usage and once bad_alloc hits, the system becomes difficult to reason about.
cc: @travisdowns as well since we have some plans to make things a little more obvious in case of bad_alloc (basically crash rather than continue).
@piyushredpanda @travisdowns @bharathv All 10 nodes are identical server builds. Same CPU, memory, disks, etc. Each server has 1 x 64 Core Epyc, 256 GB memory, 8 x 3.84 TB NVME drives in Raid-10 (for data).
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC 7763 64-Core Processor
Stepping: 1
CPU MHz: 3193.635
BogoMIPS: 4890.28
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 32768K
NUMA node0 CPU(s): 0-63
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca
Here is a 7 day graph of memory usage for one of the nodes (I can grab screenshots of the rest of the nodes but they all look the same):
total used free shared buff/cache available
Mem: 251G 212G 37G 465M 1.8G 36G
Swap: 4.0G 2.0M 4.0G
The only stat I don't readily have is messages / second as a total or by topic.
@bharathv Here is the output from journalctl -u redpanda
for each of the nodes in the cluster. Each one has had some issues. I did notice that the seastar messages go back further than I had thought but the crashing didn't start until yesterday.
Keep in mind when viewing the logs that on Nov 19th shut down all the nodes in the cluster, blew away all the data in /var/lib/redpanda/data/
uninstalled v22.2.7, installed v22.3.3, and brought the cluster back up. Some of the logs go back earlier than the 19th so you'll likely see what I was dealing with prior to the complete meltdown.
Also, I was using node0 to install 22.3.3 on prior to Nov 19th. You'll likely see that in the log. I was never able to get it to start. Eventually I went back to 22.2.7.
On Tues, Nov 29th (the same day as the issues started) -- that morning I upgraded all the nodes from v22.3.3 to v.22.3.4. That upgrade appeared to have went fine. A few hours later, when we restarted our app, is when the crashing issue started.
One other thing of note, on Monday, Nov 21st, I realized that I forgot to set 9 of 10 nodes into production mode. I was in such a hurry Saturday evening that it never crossed my mind that I needed to do that on each node. I corrected that on Monday per this Slack Thread
Question:
In the event that I was going to rebuild each node I assume I would first decommission the node then I could blow away /var/lib/redpanda/data/
. If I'm not mistaken, I need to assign it a new node_id but this cluster was built with redpanda.empty_seed_starts_cluster=false
so when I bring the node back online would it automatically assign it a different node_id?
Also, when the node is up and joins the cluster, would it automatically create all the topics and partitions on that node? I assume so but I just want to make sure.
Once that is done, does it migrate partitions / existing data to the new node or just new messages?
I guess what I'm asking is if there is a way to rebuild a live cluster without losing data by doing a single node at a time.
I enabled all the loggers and captured this info surrounding one of the bad_alloc events. Not sure if it'll be useful in any way. loggers_all.txt
@kargh : We are releasing v22.3.5 with #7057 included. As I understand, the team and @patrickangeles are reaching out to you and work with you to verify the fix/solution as it is something that's best to collaborate on.
Version & Environment
Redpanda version: 22.3.4
What went wrong?
We pushed out a new release of our app to our backend servers which many connections to each redpanda node. When we reload the app, it kills the old connections and creates new ones. It appears that there might be some sort of issue when many connections are closed / opened in a short period of time (but that is just a guess).
Redpanda started crashing and throwing all the nodes into maintenance state.
What should have happened instead?
Redpanda should not crash / throw pages of errors
Additional information
JIRA Link: CORE-1098