Multiple `Reactor stalled` and `Exception when communicating` errors after c-s threads have started

Installation details

Kernel Version: 5.15.0-1020-aws Scylla version (or git commit hash): 2022.1.3-20220922.539a55e35 with build-id d1fb2faafd95058a04aad30b675ff7d2b930278d Relocatable Package: http://downloads.scylladb.com/unstable/scylla-enterprise/enterprise-2022.1/relocatable/2022-09-22T13:36:03Z/scylla-enterprise-x86_64-package.tar.gz Cluster size: 6 nodes (i3.2xlarge)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: ami-0f6aebcffc8f7aa66 (aws: eu-west-1)

Test: scylla-cloud-longevity-cdc-100gb-4h Test id: 31c49a40-cd31-4214-b469-54011c1aaaf0 Test name: siren-tests/longevity-tests/scylla-cloud-longevity-cdc-100gb-4h Test config file(s):

longevity-cdc-100gb-4h.yaml

Issue description

By 2022-11-08 16:01:55.317 we started all the c-s threads.

Then we ran a series of nodetool cfstats and nodetool flush commands (no nemeses were started at that moment) cfstats for audit keyspace (+ flush) - succeeded

< t:2022-11-08 16:02:06,231 f:cluster.py      l:4099 c:sdcm.cluster         p:DEBUG > <cluster_cloud.ScyllaCloudCluster object at 0x7fc8d6fb6ef0>: Get cfstats on the node longevity-cdc-100gb-4h-master-db-node-31c49a40-1 for audit keyspace
< t:2022-11-08 16:02:06,231 f:events_processes.py l:147  c:sdcm.sct_events.events_processes p:DEBUG > Get process `MainDevice' from EventsProcessesRegistry[lod_dir=/home/ubuntu/sct-results/20221108-150617-417068,id=0x7fc8d70747c0,default=True]
< t:2022-11-08 16:02:06,232 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/nodetool -u scylla -pw '***'  flush "...
< t:2022-11-08 16:02:06,233 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2022-11-08 16:02:06.231: (NodetoolEvent Severity.NORMAL) period_type=begin event_id=262ff7fc-358c-40f3-9258-fc25c0d788db: nodetool_command=flush node=longevity-cdc-100gb-4h-master-d
b-node-31c49a40-1
< t:2022-11-08 16:02:06,233 f:grafana.py      l:80   c:sdcm.sct_events.grafana p:DEBUG > GrafanaEventAggregator start a new time window (90 sec)
< t:2022-11-08 16:02:08,275 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "/usr/bin/nodetool -u scylla -pw '***'  flush " finished with status 0
< t:2022-11-08 16:02:08,276 f:cluster.py      l:2567 c:sdcm.cluster_baremetal p:DEBUG > Node longevity-cdc-100gb-4h-master-db-node-31c49a40-1 [None | 172.23.40.169] (seed: True): Command '/usr/bin/nodetool -u scylla -pw '***'  flush ' duration -> 2.0436472590008
634 s

< t:2022-11-08 16:02:12,556 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "/usr/bin/nodetool -u scylla -pw '***'  cfstats audit" finished with status 0

Then cfstats for _cdctest keyspace (+ flush). flush succeeded

< t:2022-11-08 16:02:12,556 f:cluster.py      l:4099 c:sdcm.cluster         p:DEBUG > <cluster_cloud.ScyllaCloudCluster object at 0x7fc8d6fb6ef0>: Get cfstats on the node longevity-cdc-100gb-4h-master-db-node-31c49a40-1 for cdc_test keyspace
< t:2022-11-08 16:02:12,556 f:events_processes.py l:147  c:sdcm.sct_events.events_processes p:DEBUG > Get process `MainDevice' from EventsProcessesRegistry[lod_dir=/home/ubuntu/sct-results/20221108-150617-417068,id=0x7fc8d70747c0,default=True]
< t:2022-11-08 16:02:12,556 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/nodetool -u scylla -pw '***'  flush "...
< t:2022-11-08 16:02:12,558 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2022-11-08 16:02:12.556: (NodetoolEvent Severity.NORMAL) period_type=begin event_id=f134e547-e49e-4374-bcfa-4007108cb0a9: nodetool_command=flush node=longevity-cdc-100gb-4h-master-d
b-node-31c49a40-1
< t:2022-11-08 16:02:14,060 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "/usr/bin/nodetool -u scylla -pw '***'  flush " finished with status 0
< t:2022-11-08 16:02:14,060 f:cluster.py      l:2567 c:sdcm.cluster_baremetal p:DEBUG > Node longevity-cdc-100gb-4h-master-db-node-31c49a40-1 [None | 172.23.40.169] (seed: True): Command '/usr/bin/nodetool -u scylla -pw '***'  flush ' duration -> 1.5034743630003
504 s

cfstats eventually failed:

< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > Command: "/usr/bin/nodetool -u scylla -pw '***'  cfstats cdc_test"
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > 
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > Stdout:
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > 
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > 
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > 
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > Stderr:
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > 
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > 
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > 
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > Exception:  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 589, in run
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR >     channel = self.open_channel()
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 687, in open_channel
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR >     raise OpenChannelTimeout(f'Failed to open channel in {timeout} seconds')
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR >

Approximately at this time we got Reactor stalled messages on several nodes:

node-4

2022-11-08 16:02:25.726 <2022-11-08 16:02:25.000>: (DatabaseLogEvent Severity.DEBUG) period_type=one-time event_id=8b77ec21-a230-4956-8f62-22496227d840: type=REACTOR_STALLED regex=Reactor stalled line_number=422 node=longevity-cdc-100gb-4h-master-db-node-31c49a40-4
2022-11-08T16:02:25+00:00 ip-172-23-42-118     !INFO | scylla[9952]: Reactor stalled for 367 ms on shard 0. Backtrace: 0x452bdc2 0x452aa70 0x452bcd0 0x7fa116c91a1f 0x15eb994 0x15f5037 0x14d63df 0x14d4b83 0x154f442 0x47ac801
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:59
(inlined by) seastar::backtrace_buffer::append_backtrace_oneline() at ./build/release/seastar/./seastar/src/core/reactor.cc:764
(inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:783
seastar::internal::cpu_stall_detector::generate_trace() at ./build/release/seastar/./seastar/src/core/reactor.cc:1358
seastar::internal::cpu_stall_detector::maybe_report() at ./build/release/seastar/./seastar/src/core/reactor.cc:1100
(inlined by) seastar::internal::cpu_stall_detector::on_signal() at ./build/release/seastar/./seastar/src/core/reactor.cc:1117
(inlined by) seastar::reactor::block_notifier(int) at ./build/release/seastar/./seastar/src/core/reactor.cc:1341
?? ??:0
utils::uleb64_express_encode_impl(char*&, unsigned long, unsigned long) at ././utils/vle.hh:73
(inlined by) void utils::uleb64_express_encode<void (&)(char const*, unsigned long), void (&)(char const*, unsigned long)>(char*&, unsigned int, unsigned long, unsigned long, void (&)(char const*, unsigned long), void (&)(char const*, unsigned long)) at ././utils/vle.hh:82
(inlined by) logalloc::region_impl::object_descriptor::encode(char*&, unsigned long, unsigned long) const at ./utils/logalloc.cc:1222
(inlined by) logalloc::region_impl::alloc_small(logalloc::region_impl::object_descriptor const&, unsigned int, unsigned long) at ./utils/logalloc.cc:1309
logalloc::region_impl::alloc(migrate_fn_type const*, unsigned long, unsigned long) at ./utils/logalloc.cc:1655
rows_entry* allocation_strategy::construct<rows_entry, schema const&, position_in_partition_view&, seastar::bool_class<dummy_tag>, seastar::bool_class<continuous_tag> >(schema const&, position_in_partition_view&, seastar::bool_class<dummy_tag>&&, seastar::bool_class<continuous_tag>&&) at ././utils/allocation_strategy.hh:155
(inlined by) partition_snapshot_row_cursor::ensure_entry_if_complete(position_in_partition_view) at ././partition_snapshot_row_cursor.hh:604
operator() at ./partition_version.cc:448
(inlined by) _ZN8logalloc18allocating_section24with_reclaiming_disabledIRZZN15partition_entry19apply_to_incompleteERK6schemaOS2_R16mutation_cleanerRS0_RNS_6regionER13cache_trackermR27real_dirty_memory_accounterEN4$_12clEvEUlvE_EEDcSB_OT_ at ././utils/logalloc.hh:797
(inlined by) operator() at ././utils/logalloc.hh:819
(inlined by) _ZN8logalloc18allocating_section12with_reserveIZNS0_clIZZN15partition_entry19apply_to_incompleteERK6schemaOS3_R16mutation_cleanerRS0_RNS_6regionER13cache_trackermR27real_dirty_memory_accounterEN4$_12clEvEUlvE_EEDcSC_OT_EUlvE_EEDcSK_ at ././utils/logalloc.hh:768
(inlined by) _ZN8logalloc18allocating_sectionclIZZN15partition_entry19apply_to_incompleteERK6schemaOS2_R16mutation_cleanerRS0_RNS_6regionER13cache_trackermR27real_dirty_memory_accounterEN4$_12clEvEUlvE_EEDcSB_OT_ at ././utils/logalloc.hh:818
(inlined by) operator() at ./partition_version.cc:405
(inlined by) seastar::noncopyable_function<seastar::bool_class<seastar::stop_iteration_tag> ()>::indirect_vtable_for<partition_entry::apply_to_incomplete(schema const&, partition_entry&&, mutation_cleaner&, logalloc::allocating_section&, logalloc::region&, cache_tracker&, unsigned long, real_dirty_memory_accounter&)::$_12>::call(seastar::noncopyable_function<seastar::bool_class<seastar::stop_iteration_tag> ()> const*) at ././seastar/include/seastar/util/noncopyable_function.hh:153
seastar::noncopyable_function<seastar::bool_class<seastar::stop_iteration_tag> ()>::operator()() const at ././seastar/include/seastar/util/noncopyable_function.hh:209
(inlined by) utils::coroutine::run() at ././utils/coroutine.hh:40
(inlined by) operator() at ./row_cache.cc:979
(inlined by) _Z14with_allocatorIZZZN9row_cache9do_updateIZNS0_6updateENS0_16external_updaterER8memtableE4$_20EEN7seastar6futureIvEES2_S4_T_ENKUlvE_clEvENUlvE0_clEvEUlvE0_EDcR19allocation_strategyOS9_ at ././utils/allocation_strategy.hh:315
(inlined by) operator() at ./row_cache.cc:955
(inlined by) void std::__invoke_impl<void, seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}>(std::__invoke_other, seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}&&) at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/invoke.h:61
(inlined by) std::__invoke_result<seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}>::type std::__invoke<seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}>(seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}&&, (seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}&&)...) at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/invoke.h:96
(inlined by) _ZSt12__apply_implIZZN9row_cache9do_updateIZNS0_6updateENS0_16external_updaterER8memtableE4$_20EEN7seastar6futureIvEES2_S4_T_ENKUlvE_clEvEUlvE0_St5tupleIJEEJEEDcOS9_OT0_St16integer_sequenceImJXspT1_EEE at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/tuple:1843
(inlined by) _ZSt5applyIZZN9row_cache9do_updateIZNS0_6updateENS0_16external_updaterER8memtableE4$_20EEN7seastar6futureIvEES2_S4_T_ENKUlvE_clEvEUlvE0_St5tupleIJEEEDcOS9_OT0_ at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/tuple:1854
(inlined by) seastar::future<void> seastar::futurize<void>::apply<seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}>(seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}&&, std::tuple<>&&) at ././seastar/include/seastar/core/future.hh:2099
(inlined by) operator() at ././seastar/include/seastar/core/thread.hh:258
(inlined by) seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::futurize<std::invoke_result<seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}>::type>::type seastar::async<seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}>(seastar::thread_attributes, seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}&&, (seastar::futurize&&)...)::{lambda()#1}>::call(seastar::noncopyable_function<void ()> const*) at ././seastar/include/seastar/util/noncopyable_function.hh:124
seastar::noncopyable_function<void ()>::operator()() const at ./build/release/seastar/./seastar/include/seastar/util/noncopyable_function.hh:209
(inlined by) seastar::thread_context::main() at ./build/release/seastar/./seastar/src/core/thread.cc:299

2022-11-08 16:02:50.003 <2022-11-08 16:02:29.000>: (DatabaseLogEvent Severity.DEBUG) period_type=one-time event_id=8b77ec21-a230-4956-8f62-22496227d840: type=REACTOR_STALLED regex=Reactor stalled line_number=435 node=longevity-cdc-100gb-4h-master-db-node-31c49a40-4
2022-11-08T16:02:29+00:00 ip-172-23-42-118     !INFO | scylla[9952]: Reactor stalled for 345 ms on shard 1. Backtrace: 0x452bdc2 0x452aa70 0x452bcd0 0x7fa116c91a1f 0x15eb994 0x15f5037 0x15150aa 0x1551094 0x154ea90 0x47ac801
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:59
(inlined by) seastar::backtrace_buffer::append_backtrace_oneline() at ./build/release/seastar/./seastar/src/core/reactor.cc:764
(inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:783
seastar::internal::cpu_stall_detector::generate_trace() at ./build/release/seastar/./seastar/src/core/reactor.cc:1358
seastar::internal::cpu_stall_detector::maybe_report() at ./build/release/seastar/./seastar/src/core/reactor.cc:1100
(inlined by) seastar::internal::cpu_stall_detector::on_signal() at ./build/release/seastar/./seastar/src/core/reactor.cc:1117
(inlined by) seastar::reactor::block_notifier(int) at ./build/release/seastar/./seastar/src/core/reactor.cc:1341
?? ??:0
utils::uleb64_express_encode_impl(char*&, unsigned long, unsigned long) at ././utils/vle.hh:73
(inlined by) void utils::uleb64_express_encode<void (&)(char const*, unsigned long), void (&)(char const*, unsigned long)>(char*&, unsigned int, unsigned long, unsigned long, void (&)(char const*, unsigned long), void (&)(char const*, unsigned long)) at ././utils/vle.hh:82
(inlined by) logalloc::region_impl::object_descriptor::encode(char*&, unsigned long, unsigned long) const at ./utils/logalloc.cc:1222
(inlined by) logalloc::region_impl::alloc_small(logalloc::region_impl::object_descriptor const&, unsigned int, unsigned long) at ./utils/logalloc.cc:1309
logalloc::region_impl::alloc(migrate_fn_type const*, unsigned long, unsigned long) at ./utils/logalloc.cc:1655
bplus::node<long, intrusive_array<cache_entry>, dht::raw_token_less_comparator, 16ul, (bplus::key_search)0, (bplus::with_debug)0>* allocation_strategy::construct<bplus::node<long, intrusive_array<cache_entry>, dht::raw_token_less_comparator, 16ul, (bplus::key_search)0, (bplus::with_debug)0>>() at ././utils/allocation_strategy.hh:155
(inlined by) bplus::node<long, intrusive_array<cache_entry>, dht::raw_token_less_comparator, 16ul, (bplus::key_search)0, (bplus::with_debug)0>::create() at ././utils/bptree.hh:1381
(inlined by) bplus::node<long, intrusive_array<cache_entry>, dht::raw_token_less_comparator, 16ul, (bplus::key_search)0, (bplus::with_debug)0>::prealloc::push() at ././utils/bptree.hh:1852
(inlined by) bplus::node<long, intrusive_array<cache_entry>, dht::raw_token_less_comparator, 16ul, (bplus::key_search)0, (bplus::with_debug)0>::insert(unsigned long, long, bplus::data<long, intrusive_array<cache_entry>, dht::raw_token_less_comparator, 16ul, (bplus::key_search)0, (bplus::with_debug)0>*, dht::raw_token_less_comparator) at ././utils/bptree.hh:1638
bplus::tree<long, intrusive_array<cache_entry>, dht::raw_token_less_comparator, 16ul, (bplus::key_search)0, (bplus::with_debug)0>::iterator bplus::tree<long, intrusive_array<cache_entry>, dht::raw_token_less_comparator, 16ul, (bplus::key_search)0, (bplus::with_debug)0>::iterator::emplace_before<bplus::tree<long, intrusive_array<cache_entry>, dht::raw_token_less_comparator, 16ul, (bplus::key_search)0, (bplus::with_debug)0>::iterator bplus::tree<long, intrusive_array<cache_entry>, dht::raw_token_less_comparator, 16ul, (bplus::key_search)0, (bplus::with_debug)0>::iterator::emplace_before<cache_entry::evictable_tag, seastar::lw_shared_ptr<schema const>&, dht::decorated_key, partition_entry>(long, dht::raw_token_less_comparator, cache_entry::evictable_tag&&, seastar::lw_shared_ptr<schema const>&, dht::decorated_key&&, partition_entry&&)::{lambda(bplus::data<long, intrusive_array<cache_entry>, dht::raw_token_less_comparator, 16ul, (bplus::key_search)0, (bplus::with_debug)0>*)#1}, cache_entry::evictable_tag, seastar::lw_shared_ptr<schema const>&, dht::decorated_key, partition_entry>(bplus::tree<long, intrusive_array<cache_entry>, dht::raw_token_less_comparator, 16ul, (bplus::key_search)0, (bplus::with_debug)0>::iterator bplus::tree<long, intrusive_array<cache_entry>, dht::raw_token_less_comparator, 16ul, (bplus::key_search)0, (bplus::with_debug)0>::iterator::emplace_before<cache_entry::evictable_tag, seastar::lw_shared_ptr<schema const>&, dht::decorated_key, partition_entry>(long, dht::raw_token_less_comparator, cache_entry::evictable_tag&&, seastar::lw_shared_ptr<schema const>&, dht::decorated_key&&, partition_entry&&)::{lambda(bplus::data<long, intrusive_array<cache_entry>, dht::raw_token_less_comparator, 16ul, (bplus::key_search)0, (bplus::with_debug)0>*)#1}, dht::raw_token_less_comparator, cache_entry::evictable_tag&&, seastar::lw_shared_ptr<schema const>&, dht::decorated_key&&, partition_entry&&) at ././utils/bptree.hh:740
(inlined by) bplus::tree<long, intrusive_array<cache_entry>, dht::raw_token_less_comparator, 16ul, (bplus::key_search)0, (bplus::with_debug)0>::iterator bplus::tree<long, intrusive_array<cache_entry>, dht::raw_token_less_comparator, 16ul, (bplus::key_search)0, (bplus::with_debug)0>::iterator::emplace_before<cache_entry::evictable_tag, seastar::lw_shared_ptr<schema const>&, dht::decorated_key, partition_entry>(long, dht::raw_token_less_comparator, cache_entry::evictable_tag&&, seastar::lw_shared_ptr<schema const>&, dht::decorated_key&&, partition_entry&&) at ././utils/bptree.hh:752
(inlined by) double_decker<long, cache_entry, dht::raw_token_less_comparator, dht::ring_position_comparator, 16, (bplus::key_search)0, (bplus::with_debug)0>::iterator double_decker<long, cache_entry, dht::raw_token_less_comparator, dht::ring_position_comparator, 16, (bplus::key_search)0, (bplus::with_debug)0>::emplace_before<cache_entry::evictable_tag, seastar::lw_shared_ptr<schema const>&, dht::decorated_key, partition_entry>(double_decker<long, cache_entry, dht::raw_token_less_comparator, dht::ring_position_comparator, 16, (bplus::key_search)0, (bplus::with_debug)0>::iterator, long, double_decker<long, cache_entry, dht::raw_token_less_comparator, dht::ring_position_comparator, 16, (bplus::key_search)0, (bplus::with_debug)0>::bound_hint const&, cache_entry::evictable_tag&&, seastar::lw_shared_ptr<schema const>&, dht::decorated_key&&, partition_entry&&) at ././utils/double-decker.hh:211
operator() at ./row_cache.cc:1033
(inlined by) operator() at ./row_cache.cc:973
(inlined by) _ZN8logalloc18allocating_section24with_reclaiming_disabledIRZZZZN9row_cache9do_updateIZNS2_6updateENS2_16external_updaterER8memtableE4$_20EEN7seastar6futureIvEES4_S6_T_ENKUlvE_clEvENUlvE0_clEvENKUlvE0_clEvEUlvE_EEDcRNS_6regionEOSB_ at ././utils/logalloc.hh:797
(inlined by) operator() at ././utils/logalloc.hh:819
(inlined by) _ZN8logalloc18allocating_section12with_reserveIZNS0_clIZZZZN9row_cache9do_updateIZNS3_6updateENS3_16external_updaterER8memtableE4$_20EEN7seastar6futureIvEES5_S7_T_ENKUlvE_clEvENUlvE0_clEvENKUlvE0_clEvEUlvE_EEDcRNS_6regionEOSC_EUlvE_EEDcSJ_ at ././utils/logalloc.hh:768
(inlined by) _ZN8logalloc18allocating_sectionclIZZZZN9row_cache9do_updateIZNS2_6updateENS2_16external_updaterER8memtableE4$_20EEN7seastar6futureIvEES4_S6_T_ENKUlvE_clEvENUlvE0_clEvENKUlvE0_clEvEUlvE_EEDcRNS_6regionEOSB_ at ././utils/logalloc.hh:818
(inlined by) operator() at ./row_cache.cc:968
(inlined by) _Z14with_allocatorIZZZN9row_cache9do_updateIZNS0_6updateENS0_16external_updaterER8memtableE4$_20EEN7seastar6futureIvEES2_S4_T_ENKUlvE_clEvENUlvE0_clEvEUlvE0_EDcR19allocation_strategyOS9_ at ././utils/allocation_strategy.hh:315
(inlined by) operator() at ./row_cache.cc:955
(inlined by) void std::__invoke_impl<void, seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}>(std::__invoke_other, seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}&&) at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/invoke.h:61
(inlined by) std::__invoke_result<seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}>::type std::__invoke<seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}>(seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}&&, (seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}&&)...) at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/invoke.h:96
(inlined by) _ZSt12__apply_implIZZN9row_cache9do_updateIZNS0_6updateENS0_16external_updaterER8memtableE4$_20EEN7seastar6futureIvEES2_S4_T_ENKUlvE_clEvEUlvE0_St5tupleIJEEJEEDcOS9_OT0_St16integer_sequenceImJXspT1_EEE at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/tuple:1843
(inlined by) _ZSt5applyIZZN9row_cache9do_updateIZNS0_6updateENS0_16external_updaterER8memtableE4$_20EEN7seastar6futureIvEES2_S4_T_ENKUlvE_clEvEUlvE0_St5tupleIJEEEDcOS9_OT0_ at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/tuple:1854
(inlined by) seastar::future<void> seastar::futurize<void>::apply<seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}>(seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}&&, std::tuple<>&&) at ././seastar/include/seastar/core/future.hh:2099
(inlined by) operator() at ././seastar/include/seastar/core/thread.hh:258
(inlined by) seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::futurize<std::invoke_result<seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}>::type>::type seastar::async<seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}>(seastar::thread_attributes, seastar::future<void> row_cache::do_update<row_cache::update(row_cache::external_updater, memtable&)::$_20>(row_cache::external_updater, memtable&, row_cache::update(row_cache::external_updater, memtable&)::$_20)::{lambda()#1}::operator()() const::{lambda()#2}&&, (seastar::futurize&&)...)::{lambda()#1}>::call(seastar::noncopyable_function<void ()> const*) at ././seastar/include/seastar/util/noncopyable_function.hh:124
seastar::noncopyable_function<void ()>::operator()() const at ./build/release/seastar/./seastar/include/seastar/util/noncopyable_function.hh:209
(inlined by) seastar::thread_context::main() at ./build/release/seastar/./seastar/src/core/thread.cc:299

node-5: node_5_reactor_stalled.txt
node-1: node_1_reactor_stalled.txt

Also, a lot of the following error messages were found in the node logs:

2022-11-08T16:02:41+00:00 ip-172-23-42-118      !ERR | scylla[9952]:  [shard 0] storage_proxy - Exception when communicating with 172.23.42.118, to read from cdc_test.test_table: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:02:41+00:00 ip-172-23-42-118      !ERR | scylla[9952]:  [shard 0] storage_proxy - Exception when communicating with 172.23.42.118, to read from cdc_test.test_table_preimage: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:02:41+00:00 ip-172-23-42-118      !ERR | scylla[9952]:  [shard 0] storage_proxy - Exception when communicating with 172.23.42.118, to read from cdc_test.test_table_preimage: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:02:41+00:00 ip-172-23-42-118      !ERR | scylla[9952]:  [shard 0] storage_proxy - Exception when communicating with 172.23.42.118, to read from cdc_test.test_table: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:02:41+00:00 ip-172-23-42-118      !ERR | scylla[9952]:  [shard 0] storage_proxy - Exception when communicating with 172.23.42.118, to read from cdc_test.test_table_preimage: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:02:41+00:00 ip-172-23-42-118      !ERR | scylla[9952]:  [shard 0] storage_proxy - Exception when communicating with 172.23.42.118, to read from cdc_test.test_table: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:02:41+00:00 ip-172-23-42-118      !ERR | scylla[9952]:  [shard 0] storage_proxy - Exception when communicating with 172.23.42.118, to read from cdc_test.test_table_preimage: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)

2022-11-08T16:03:00+00:00 ip-172-23-40-225      !ERR | scylla[9834]:  [shard 1] storage_proxy - Exception when communicating with 172.23.40.225, to read from cdc_test.test_table: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:03:00+00:00 ip-172-23-40-225      !ERR | scylla[9834]:  [shard 1] storage_proxy - Exception when communicating with 172.23.40.225, to read from cdc_test.test_table_preimage: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:03:00+00:00 ip-172-23-40-225      !ERR | scylla[9834]:  [shard 1] storage_proxy - Exception when communicating with 172.23.40.225, to read from cdc_test.test_table: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:03:00+00:00 ip-172-23-40-225      !ERR | scylla[9834]:  [shard 1] storage_proxy - Exception when communicating with 172.23.40.225, to read from cdc_test.test_table_preimage: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:03:00+00:00 ip-172-23-40-225      !ERR | scylla[9834]:  [shard 1] storage_proxy - Exception when communicating with 172.23.40.225, to read from cdc_test.test_table: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:03:00+00:00 ip-172-23-40-225      !ERR | scylla[9834]:  [shard 1] storage_proxy - Exception when communicating with 172.23.40.225, to read from cdc_test.test_table: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:03:00+00:00 ip-172-23-40-225      !ERR | scylla[9834]:  [shard 1] storage_proxy - Exception when communicating with 172.23.40.225, to read from cdc_test.test_table: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:03:00+00:00 ip-172-23-40-225      !ERR | scylla[9834]:  [shard 1] storage_proxy - Exception when communicating with 172.23.40.225, to read from cdc_test.test_table_preimage: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)

Since nodetool command failed we triggered the coredump generation for scylla on node-1:

< t:2022-11-08 16:04:16,054 f:cluster.py      l:2472 c:sdcm.cluster_baremetal p:INFO  > Node longevity-cdc-100gb-4h-master-db-node-31c49a40-1 [None | 172.23.40.169] (seed: True): Generate scylla core
< t:2022-11-08 16:04:16,055 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo pkill -f --signal 3 /usr/bin/scylla"...

Coredump: node_1_scylla_coredump.txt

Restore Monitor Stack command: $ hydra investigate show-monitor 31c49a40-cd31-4214-b469-54011c1aaaf0
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 31c49a40-cd31-4214-b469-54011c1aaaf0

Logs:

db-cluster-31c49a40.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/31c49a40-cd31-4214-b469-54011c1aaaf0/20221108_161719/db-cluster-31c49a40.tar.gz
monitor-set-31c49a40.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/31c49a40-cd31-4214-b469-54011c1aaaf0/20221108_161719/monitor-set-31c49a40.tar.gz
loader-set-31c49a40.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/31c49a40-cd31-4214-b469-54011c1aaaf0/20221108_161719/loader-set-31c49a40.tar.gz
sct-runner-31c49a40.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/31c49a40-cd31-4214-b469-54011c1aaaf0/20221108_161719/sct-runner-31c49a40.tar.gz

Jenkins job URL

scylladb / scylladb