Closed bentsi closed 4 years ago
@asias is the database - Replace address on first boot requested; this node is already bootstrapped
message expected?
@asias is the
database - Replace address on first boot requested; this node is already bootstrapped
message expected?
Yes, it is expected and OK. We see such log when is option replace_address_first_boot is used to replace a node. When the replacing node boot again, it will not do replace operation but normal boot. With replace_address option, user has to remove the option after replace operation is done.
@asias the node is being drained not being started - why does it print this message during drain ?
Aug 22 22:58:50 ip-10-0-3-36.eu-west-1.compute.internal scylla.bin[8688]: [shard 0] storage_service - DRAINING: starting drain process
Aug 22 22:58:50 ip-10-0-3-36.eu-west-1.compute.internal scylla.bin[8688]: [shard 0] storage_service - Thrift server stopped
Aug 22 22:58:50 ip-10-0-3-36.eu-west-1.compute.internal scylla.bin[8688]: [shard 0] storage_service - CQL server stopped
Aug 22 22:58:50 ip-10-0-3-36.eu-west-1.compute.internal scylla.bin[8688]: [shard 0] gossip - My status = NORMAL
Aug 22 22:58:50 ip-10-0-3-36.eu-west-1.compute.internal scylla.bin[8688]: [shard 0] gossip - Announcing shutdown
Aug 22 22:58:50 ip-10-0-3-36.eu-west-1.compute.internal scylla.bin[8688]: [shard 0] storage_service - Node 10.0.3.36 state jump to normal
Aug 22 22:58:50 ip-10-0-3-36.eu-west-1.compute.internal scylla.bin[8688]: [shard 0] database - Replace address on first boot requested; this node is already bootstrapped
could it be that we are trying to drain a node that did not complete the bootstrap and as such run into issues
@bentsi can we get the complete log of this test ?
@asias is there anything blocking drain before the node is fully up ?
job url: https://jenkins.scylladb.com/view/scylla-3.1/job/scylla-3.1/job/longevity/job/longevity-2tb-4days-1Dis-2NonDis-Nemesis/10/consoleFull log: s3://cloudius-jenkins-test/6bc17d29-86a2-4897-92e8-33fed927fbe2/sct.log.gz
According to the log, the drain (2019-08-22 22:58:55,488) happened after the node is fully initialized (2019-08-22T22:24:56+00:00).
2449037 < t:2019-08-22 22:58:55,488 f:cluster.py l:1212 c:sdcm.cluster p:DEBUG > 019-08-22T22:58:50+00:00 ip-10-0-3-36 !INFO | scylla.bin: [shard 0] storage_service - DRAINING: starting drain process
$ cat ip-10-0-3-36.txt.2|grep 3.1.0.rc3-0.20190818.d06bcef3b
< t:2019-08-21 11:53:38,982 f:cluster.py l:1212 c:sdcm.cluster p:DEBUG > 2019-08-21T11:53:35+00:00 ip-10-0-3-36 !INFO | scylla.bin: [shard 0] init - Scylla version 3.1.0.rc3-0.20190818.d06bcef3b starting.
< t:2019-08-21 11:54:09,024 f:cluster.py l:1212 c:sdcm.cluster p:DEBUG > 2019-08-21T11:54:05+00:00 ip-10-0-3-36 !INFO | scylla.bin: [shard 0] init - Scylla version 3.1.0.rc3-0.20190818.d06bcef3b initialization completed.
< t:2019-08-22 12:34:43,926 f:cluster.py l:1212 c:sdcm.cluster p:DEBUG > 2019-08-22T12:34:41+00:00 ip-10-0-3-36 !INFO | scylla.bin: [shard 0] init - Scylla version 3.1.0.rc3-0.20190818.d06bcef3b starting.
< t:2019-08-22 12:37:23,138 f:cluster.py l:1212 c:sdcm.cluster p:DEBUG > 2019-08-22T12:37:11+00:00 ip-10-0-3-36 !INFO | scylla.bin: [shard 0] init - Scylla version 3.1.0.rc3-0.20190818.d06bcef3b starting.
< t:2019-08-22 15:19:03,457 f:cluster.py l:1212 c:sdcm.cluster p:DEBUG > 2019-08-22T15:18:59+00:00 ip-10-0-3-36 !INFO | scylla.bin: [shard 0] init - Scylla version 3.1.0.rc3-0.20190818.d06bcef3b initialization completed.
< t:2019-08-22 21:12:51,378 f:cluster.py l:1212 c:sdcm.cluster p:DEBUG > 2019-08-22T21:12:43+00:00 ip-10-0-3-36 !INFO | scylla.bin: [shard 0] init - Scylla version 3.1.0.rc3-0.20190818.d06bcef3b shutdown complete.
< t:2019-08-22 21:17:51,647 f:cluster.py l:1212 c:sdcm.cluster p:DEBUG > 2019-08-22T21:17:44+00:00 ip-10-0-3-36 !INFO | scylla.bin: [shard 0] init - Scylla version 3.1.0.rc3-0.20190818.d06bcef3b starting.
< t:2019-08-22 21:18:22,713 f:cluster.py l:1212 c:sdcm.cluster p:DEBUG > 2019-08-22T21:18:17+00:00 ip-10-0-3-36 !INFO | scylla.bin: [shard 0] init - Scylla version 3.1.0.rc3-0.20190818.d06bcef3b initialization completed.
< t:2019-08-22 21:50:57,354 f:cluster.py l:1212 c:sdcm.cluster p:DEBUG > 2019-08-22T21:50:43+00:00 ip-10-0-3-36 !INFO | scylla.bin: [shard 0] init - Scylla version 3.1.0.rc3-0.20190818.d06bcef3b starting.
< t:2019-08-22 21:51:28,454 f:cluster.py l:1212 c:sdcm.cluster p:DEBUG > 2019-08-22T21:51:17+00:00 ip-10-0-3-36 !INFO | scylla.bin: [shard 0] init - Scylla version 3.1.0.rc3-0.20190818.d06bcef3b initialization completed.
< t:2019-08-22 22:24:16,976 f:cluster.py l:1212 c:sdcm.cluster p:DEBUG > 2019-08-22T22:24:13+00:00 ip-10-0-3-36 !INFO | scylla.bin: [shard 0] init - Scylla version 3.1.0.rc3-0.20190818.d06bcef3b shutdown complete.
< t:2019-08-22 22:24:32,306 f:cluster.py l:1212 c:sdcm.cluster p:DEBUG > 2019-08-22T22:24:24+00:00 ip-10-0-3-36 !INFO | scylla.bin: [shard 0] init - Scylla version 3.1.0.rc3-0.20190818.d06bcef3b starting.
< t:2019-08-22 22:25:03,426 f:cluster.py l:1212 c:sdcm.cluster p:DEBUG > 2019-08-22T22:24:56+00:00 ip-10-0-3-36 !INFO | scylla.bin: [shard 0] init - Scylla version 3.1.0.rc3-0.20190818.d06bcef3b initialization completed.
@asias the node is being drained not being started - why does it print this message during drain ?
Aug 22 22:58:50 ip-10-0-3-36.eu-west-1.compute.internal scylla.bin[8688]: [shard 0] storage_service - DRAINING: starting drain process Aug 22 22:58:50 ip-10-0-3-36.eu-west-1.compute.internal scylla.bin[8688]: [shard 0] storage_service - Thrift server stopped Aug 22 22:58:50 ip-10-0-3-36.eu-west-1.compute.internal scylla.bin[8688]: [shard 0] storage_service - CQL server stopped Aug 22 22:58:50 ip-10-0-3-36.eu-west-1.compute.internal scylla.bin[8688]: [shard 0] gossip - My status = NORMAL Aug 22 22:58:50 ip-10-0-3-36.eu-west-1.compute.internal scylla.bin[8688]: [shard 0] gossip - Announcing shutdown Aug 22 22:58:50 ip-10-0-3-36.eu-west-1.compute.internal scylla.bin[8688]: [shard 0] storage_service - Node 10.0.3.36 state jump to normal Aug 22 22:58:50 ip-10-0-3-36.eu-west-1.compute.internal scylla.bin[8688]: [shard 0] database - Replace address on first boot requested; this node is already bootstrapped
could it be that we are trying to drain a node that did not complete the bootstrap and as such run into issues
I assume you are surprised by the 'Replace address on first boot requested; this node is already bootstrapped'. It is printed because after we see a node becomes normal status, it will db().local().is_replacing() which prints the log, in storage_service::handle_state_normal. This log is expected and harmless.
The other logs like below are completely normal, because the drain operation is requested.
Aug 22 22:58:50 ip-10-0-3-36.eu-west-1.compute.internal scylla.bin[8688]: [shard 0] storage_service - DRAINING: starting drain process
Aug 22 22:58:50 ip-10-0-3-36.eu-west-1.compute.internal scylla.bin[8688]: [shard 0] storage_service - Thrift server stopped
Aug 22 22:58:50 ip-10-0-3-36.eu-west-1.compute.internal scylla.bin[8688]: [shard 0] storage_service - CQL server stopped
Aug 22 22:58:50 ip-10-0-3-36.eu-west-1.compute.internal scylla.bin[8688]: [shard 0] gossip - My status = NORMAL
Aug 22 22:58:50 ip-10-0-3-36.eu-west-1.compute.internal scylla.bin[8688]: [shard 0] gossip - Announcing shutdown
Aug 22 22:58:50 ip-10-0-3-36.eu-west-1.compute.internal scylla.bin[8688]: [shard 0] storage_service - Node 10.0.3.36 state jump to normal
The drain api wil call into
storage_service::drain:
2568 ss.set_mode(mode::DRAINING, "starting drain process", true);
2569 ss.shutdown_client_servers().get();
2570 gms::stop_gossiping().get();
2571
2572 ss.set_mode(mode::DRAINING, "shutting down messaging_service", false);
2573 ss.do_stop_ms().get();
The next log message we are supposed to see is "shutting down messaging_service" but we did not. We see cql server is stopped ("storage_service - CQL server stopped") which means the ss.shutdown_client_servers().get() is done. So, probably the drain hang inside gms::stop_gossiping. The only place inside gms::stop_gossiping is timer_callback_lock().get() which takes a semaphore.
@bentsi if this reproduces lets do two things - try to ping and find a developer to access the machine
if not avail create a coredump and upload and someone will try to analyze as to why we are stuck at this stops
added a coredump to related issue #4971
The issue got reproduced on 3.1.0.rc8-0.20190925.d7fc7bcf9
Created coredump. Stacktrace
Oct 02 11:36:30 ip-10-0-162-150.eu-west-1.compute.internal systemd-coredump[37605]: Process 1758 (scylla) of user 996 dumped core.
Stack trace of thread 1758:
#0 0x00007ff7abc3aefd syscall (libc.so.6)
#1 0x0000000004543507 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla)
#2 0x0000000004191696 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla)
#3 0x0000000004192191 _ZN7seastar19reactor_backend_aio16wait_and_processEiPK10__sigset_t (scylla)
#4 0x0000000004072205 _ZN7seastar7reactor16wait_and_processEiPK10__sigset_t (scylla)
#5 0x0000000004153ac6 _ZN7seastar7reactor3runEv (scylla)
#6 0x0000000003fdbd87 _ZN7seastar12app_template14run_deprecatedEiPPcOSt8functionIFvvEE (scylla)
#7 0x0000000003fdd4d3 _ZN7seastar12app_template3runEiPPcOSt8functionIFNS_6futureIJiEEEvEE (scylla)
#8 0x00000000007da90a main (scylla)
#9 0x00007ff7abb67413 __libc_start_main (libc.so.6)
#10 0x000000000083ad9e _start (scylla)
Stack trace of thread 1774:
#0 0x00007ff7ac274df4 read (libpthread.so.0)
#1 0x00000000040728df _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2 0x0000000004073f22 operator() (scylla)
#3 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#4 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#5 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1773:
#0 0x00007ff7ac274df4 read (libpthread.so.0)
#1 0x00000000040728df _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2 0x0000000004073f22 operator() (scylla)
#3 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#4 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#5 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1777:
#0 0x00007ff7ac274df4 read (libpthread.so.0)
#1 0x00000000040728df _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2 0x0000000004073f22 operator() (scylla)
#3 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#4 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#5 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1778:
#0 0x00007ff7ac274df4 read (libpthread.so.0)
#1 0x00000000040728df _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2 0x0000000004073f22 operator() (scylla)
#3 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#4 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#5 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1780:
#0 0x00007ff7ac274df4 read (libpthread.so.0)
#1 0x00000000040728df _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2 0x0000000004073f22 operator() (scylla)
#3 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#4 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#5 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1776:
#0 0x00007ff7ac274df4 read (libpthread.so.0)
#1 0x00000000040728df _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2 0x0000000004073f22 operator() (scylla)
#3 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#4 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#5 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1775:
#0 0x00007ff7ac274df4 read (libpthread.so.0)
#1 0x00000000040728df _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2 0x0000000004073f22 operator() (scylla)
#3 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#4 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#5 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1783:
#0 0x00007ff7ac274df4 read (libpthread.so.0)
#1 0x00000000040728df _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2 0x0000000004073f22 operator() (scylla)
#3 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#4 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#5 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1782:
#0 0x00007ff7ac274df4 read (libpthread.so.0)
#1 0x00000000040728df _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2 0x0000000004073f22 operator() (scylla)
#3 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#4 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#5 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1763:
#0 0x00007ff7abc3aefd syscall (libc.so.6)
#1 0x0000000004543507 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla)
#2 0x0000000004191696 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla)
#3 0x0000000004192191 _ZN7seastar19reactor_backend_aio16wait_and_processEiPK10__sigset_t (scylla)
#4 0x0000000004072205 _ZN7seastar7reactor16wait_and_processEiPK10__sigset_t (scylla)
#5 0x0000000004153ac6 _ZN7seastar7reactor3runEv (scylla)
#6 0x000000000417dfcc _ZZN7seastar3smp9configureEN5boost15program_options13variables_mapENS_14reactor_configEENKUlvE1_clEv (scylla)
#7 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#8 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#9 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1772:
#0 0x00007ff7ac274df4 read (libpthread.so.0)
#1 0x00000000040728df _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2 0x0000000004073f22 operator() (scylla)
#3 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#4 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#5 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1781:
#0 0x00007ff7ac274df4 read (libpthread.so.0)
#1 0x00000000040728df _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2 0x0000000004073f22 operator() (scylla)
#3 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#4 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#5 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1779:
#0 0x00007ff7ac274df4 read (libpthread.so.0)
#1 0x00000000040728df _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2 0x0000000004073f22 operator() (scylla)
#3 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#4 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#5 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1760:
#0 0x00007ff7abc3aefd syscall (libc.so.6)
#1 0x0000000004543507 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla)
#2 0x0000000004191696 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla)
#3 0x0000000004192191 _ZN7seastar19reactor_backend_aio16wait_and_processEiPK10__sigset_t (scylla)
#4 0x0000000004072205 _ZN7seastar7reactor16wait_and_processEiPK10__sigset_t (scylla)
#5 0x0000000004153ac6 _ZN7seastar7reactor3runEv (scylla)
#6 0x000000000417dfcc _ZZN7seastar3smp9configureEN5boost15program_options13variables_mapENS_14reactor_configEENKUlvE1_clEv (scylla)
#7 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#8 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#9 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1761:
#0 0x00007ff7abc3aefd syscall (libc.so.6)
#1 0x0000000004543507 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla)
#2 0x0000000004191696 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla)
#3 0x0000000004192191 _ZN7seastar19reactor_backend_aio16wait_and_processEiPK10__sigset_t (scylla)
#4 0x0000000004072205 _ZN7seastar7reactor16wait_and_processEiPK10__sigset_t (scylla)
#5 0x0000000004153ac6 _ZN7seastar7reactor3runEv (scylla)
#6 0x000000000417dfcc _ZZN7seastar3smp9configureEN5boost15program_options13variables_mapENS_14reactor_configEENKUlvE1_clEv (scylla)
#7 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#8 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#9 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1766:
#0 0x00007ff7abc3aefd syscall (libc.so.6)
#1 0x0000000004543507 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla)
#2 0x0000000004191696 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla)
#3 0x0000000004192191 _ZN7seastar19reactor_backend_aio16wait_and_processEiPK10__sigset_t (scylla)
#4 0x0000000004072205 _ZN7seastar7reactor16wait_and_processEiPK10__sigset_t (scylla)
#5 0x0000000004153ac6 _ZN7seastar7reactor3runEv (scylla)
#6 0x000000000417dfcc _ZZN7seastar3smp9configureEN5boost15program_options13variables_mapENS_14reactor_configEENKUlvE1_clEv (scylla)
#7 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#8 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#9 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1768:
#0 0x00007ff7abc3aefd syscall (libc.so.6)
#1 0x0000000004543507 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla)
#2 0x0000000004191696 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla)
#3 0x0000000004192191 _ZN7seastar19reactor_backend_aio16wait_and_processEiPK10__sigset_t (scylla)
#4 0x0000000004072205 _ZN7seastar7reactor16wait_and_processEiPK10__sigset_t (scylla)
#5 0x0000000004153ac6 _ZN7seastar7reactor3runEv (scylla)
#6 0x000000000417dfcc _ZZN7seastar3smp9configureEN5boost15program_options13variables_mapENS_14reactor_configEENKUlvE1_clEv (scylla)
#7 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#8 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#9 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1785:
#0 0x00007ff7ac274df4 read (libpthread.so.0)
#1 0x00000000040728df _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2 0x0000000004073f22 operator() (scylla)
#3 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#4 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#5 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1759:
#0 0x00007ff7abc3aefd syscall (libc.so.6)
#1 0x0000000004543507 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla)
#2 0x0000000004191696 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla)
#3 0x0000000004192191 _ZN7seastar19reactor_backend_aio16wait_and_processEiPK10__sigset_t (scylla)
#4 0x0000000004072205 _ZN7seastar7reactor16wait_and_processEiPK10__sigset_t (scylla)
#5 0x0000000004153ac6 _ZN7seastar7reactor3runEv (scylla)
#6 0x000000000417dfcc _ZZN7seastar3smp9configureEN5boost15program_options13variables_mapENS_14reactor_configEENKUlvE1_clEv (scylla)
#7 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#8 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#9 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1784:
#0 0x00007ff7ac274df4 read (libpthread.so.0)
#1 0x00000000040728df _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2 0x0000000004073f22 operator() (scylla)
#3 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#4 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#5 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1762:
#0 0x00007ff7abc3aefd syscall (libc.so.6)
#1 0x0000000004543507 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla)
#2 0x0000000004191696 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla)
#3 0x0000000004192191 _ZN7seastar19reactor_backend_aio16wait_and_processEiPK10__sigset_t (scylla)
#4 0x0000000004072205 _ZN7seastar7reactor16wait_and_processEiPK10__sigset_t (scylla)
#5 0x0000000004153ac6 _ZN7seastar7reactor3runEv (scylla)
#6 0x000000000417dfcc _ZZN7seastar3smp9configureEN5boost15program_options13variables_mapENS_14reactor_configEENKUlvE1_clEv (scylla)
#7 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#8 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#9 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1769:
#0 0x00007ff7abc3aefd syscall (libc.so.6)
#1 0x0000000004543507 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla)
#2 0x0000000004191696 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla)
#3 0x0000000004192191 _ZN7seastar19reactor_backend_aio16wait_and_processEiPK10__sigset_t (scylla)
#4 0x0000000004072205 _ZN7seastar7reactor16wait_and_processEiPK10__sigset_t (scylla)
#5 0x0000000004153ac6 _ZN7seastar7reactor3runEv (scylla)
#6 0x000000000417dfcc _ZZN7seastar3smp9configureEN5boost15program_options13variables_mapENS_14reactor_configEENKUlvE1_clEv (scylla)
#7 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#8 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#9 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1767:
#0 0x00007ff7abc3aefd syscall (libc.so.6)
#1 0x0000000004543507 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla)
#2 0x0000000004191696 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla)
#3 0x0000000004192191 _ZN7seastar19reactor_backend_aio16wait_and_processEiPK10__sigset_t (scylla)
#4 0x0000000004072205 _ZN7seastar7reactor16wait_and_processEiPK10__sigset_t (scylla)
#5 0x0000000004153ac6 _ZN7seastar7reactor3runEv (scylla)
#6 0x000000000417dfcc _ZZN7seastar3smp9configureEN5boost15program_options13variables_mapENS_14reactor_configEENKUlvE1_clEv (scylla)
#7 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#8 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#9 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1771:
#0 0x00007ff7abc3aefd syscall (libc.so.6)
#1 0x0000000004543507 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla)
#2 0x0000000004191696 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla)
#3 0x0000000004192191 _ZN7seastar19reactor_backend_aio16wait_and_processEiPK10__sigset_t (scylla)
#4 0x0000000004072205 _ZN7seastar7reactor16wait_and_processEiPK10__sigset_t (scylla)
#5 0x0000000004153ac6 _ZN7seastar7reactor3runEv (scylla)
#6 0x000000000417dfcc _ZZN7seastar3smp9configureEN5boost15program_options13variables_mapENS_14reactor_configEENKUlvE1_clEv (scylla)
#7 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#8 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#9 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1770:
#0 0x00007ff7abc3aefd syscall (libc.so.6)
#1 0x0000000004543507 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla)
#2 0x0000000004191696 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla)
#3 0x0000000004192191 _ZN7seastar19reactor_backend_aio16wait_and_processEiPK10__sigset_t (scylla)
#4 0x0000000004072205 _ZN7seastar7reactor16wait_and_processEiPK10__sigset_t (scylla)
#5 0x0000000004153ac6 _ZN7seastar7reactor3runEv (scylla)
#6 0x000000000417dfcc _ZZN7seastar3smp9configureEN5boost15program_options13variables_mapENS_14reactor_configEENKUlvE1_clEv (scylla)
#7 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#8 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#9 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1765:
#0 0x00007ff7abc3aefd syscall (libc.so.6)
#1 0x0000000004543507 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla)
#2 0x0000000004191696 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla)
#3 0x0000000004192191 _ZN7seastar19reactor_backend_aio16wait_and_processEiPK10__sigset_t (scylla)
#4 0x0000000004072205 _ZN7seastar7reactor16wait_and_processEiPK10__sigset_t (scylla)
#5 0x0000000004153ac6 _ZN7seastar7reactor3runEv (scylla)
#6 0x000000000417dfcc _ZZN7seastar3smp9configureEN5boost15program_options13variables_mapENS_14reactor_configEENKUlvE1_clEv (scylla)
#7 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#8 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#9 0x00007ff7abc406a3 __clone (libc.so.6)
Stack trace of thread 1764:
#0 0x00007ff7abc3aefd syscall (libc.so.6)
#1 0x0000000004543507 _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla)
#2 0x0000000004191696 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla)
#3 0x0000000004192191 _ZN7seastar19reactor_backend_aio16wait_and_processEiPK10__sigset_t (scylla)
#4 0x0000000004072205 _ZN7seastar7reactor16wait_and_processEiPK10__sigset_t (scylla)
#5 0x0000000004153ac6 _ZN7seastar7reactor3runEv (scylla)
#6 0x000000000417dfcc _ZZN7seastar3smp9configureEN5boost15program_options13variables_mapENS_14reactor_configEENKUlvE1_clEv (scylla)
#7 0x000000000403fc8e _ZNKSt8functionIFvvEEclEv (scylla)
#8 0x00007ff7ac26b58e start_thread (libpthread.so.0)
#9 0x00007ff7abc406a3 __clone (libc.so.6)
Oct 02 11:36:32 ip-10-0-162-150.eu-west-1.compute.internal scylla[37686]: [shard 2] commitlog_replayer - Corrupted file: /var/lib/scylla/commitlog/CommitLog-1-36028797018983273.log. 11796472 bytes skipped.
Oct 02 11:36:33 ip-10-0-162-150.eu-west-1.compute.internal scylla[37686]: [shard 12] commitlog_replayer - Corrupted file: /var/lib/scylla/commitlog/CommitLog-1-216172782113803109.log. 3362808 bytes skipped.
Oct 02 11:36:36 ip-10-0-162-150.eu-west-1.compute.internal scylla[37686]: [shard 0] commitlog_replayer - Corrupted file: /var/lib/scylla/commitlog/CommitLog-1-19274.log. 27992056 bytes skipped.
Oct 02 11:36:40 ip-10-0-162-150.eu-west-1.compute.internal scylla[37686]: [shard 6] commitlog_replayer - Corrupted file: /var/lib/scylla/commitlog/CommitLog-1-108086391056911184.log. 798712 bytes skipped.
Oct 02 11:36:42 ip-10-0-162-150.eu-west-1.compute.internal scylla[37686]: [shard 8] commitlog_replayer - Corrupted file: /var/lib/scylla/commitlog/CommitLog-1-144115188075875263.log. 33312760 bytes skipped.
Oct 02 11:36:42 ip-10-0-162-150.eu-west-1.compute.internal scylla[37686]: [shard 7] commitlog_replayer - Corrupted file: /var/lib/scylla/commitlog/CommitLog-1-126100789566393224.log. 7131128 bytes skipped.
Oct 02 11:36:43 ip-10-0-162-150.eu-west-1.compute.internal scylla[37686]: [shard 4] commitlog_replayer - Corrupted file: /var/lib/scylla/commitlog/CommitLog-1-72057594037947260.log. 5128184 bytes skipped.
Oct 02 11:36:56 ip-10-0-162-150.eu-west-1.compute.internal scylla[37686]: [shard 1] commitlog_replayer - Corrupted file: /var/lib/scylla/commitlog/CommitLog-1-18014398509501292.log. 3522552 bytes skipped.
Oct 02 11:36:57 ip-10-0-162-150.eu-west-1.compute.internal scylla[37686]: [shard 13] commitlog_replayer - Corrupted file: /var/lib/scylla/commitlog/CommitLog-1-234187180623285143.log. 26365944 bytes skipped.
Strange thing is that after kill -3
commitlog was corrupted
Will upload the coredump soon
Coredump
gsutil cp gs://upload.scylladb.com/core.scylla.996.582fe616b75445a1b0e7218c09fbd66d.1758.1570016082000000.gz .
CC: @espindola this is probably related to decommission issue that we are trying to reproduce
@bentsi where are the associated rpms (or relocatable packages)?
I am copying the core dump.
The core dump is with rc8?
I am starting to look at the core this morning.
I am uploading the core dump and binaries into a easy to use package to gs://scratch.scylladb.com/espindola/issue-4891/debug-4891.tar.xz It should finish in a day more or less :-(
Do we have the logs of what the node was doing before it got stuck?
I went down a rabbit hole to get a good debugging environment. Finally I managed to get all relevant debug info and pretty printers that work (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91997).
First think I found that is interesting is that in all io_queues found in seastar::local_engine->_io_queues in all threads _fq._requests_queued is zero. If I understand the code correctly, it means that there is no pending async kernel request, is that correct?
For sanity I checked that queues in seastar::local_engine->_task_queues are not active. That is, _active = false, so there are no ready tasks that are somehow not being run.
I also checked that _requests_executing of every of every fair queue found via seastar::local_engine->_io_queues is 0.
Trying a bit of brute force. There are two vtables that match _ZTVN7seastar12continuation.gossiper.do_stop: 000000000492bef0 and 000000000492b978.
Running "scylla apply scylla find 0xaddr" for both found nothing.
Since gossiper::do_stop_gossiping uses a seastar thread the next thing I wanted to do is check what are the seastar threads that are currently active. It is possible to list the threads with "scylla threads", but trying to switch to one fails since get_fs_base seems designed to work only on live processes.
Is there any way to print the seastar thread stacks in a core?
There are 373 vtables that match ZTVN7seastar12continuation.storage_service or _ZTVN7seastar12continuation.gossiper. I will leave gdb searching for them overnight.
I got a failure from scylla-gdb.py. I will try to use the one from master. The error was:
Traceback (most recent call last):
File "~/scylla/scylla/scylla-gdb.py", line 1954, in invoke
File "~/scylla/scylla/scylla-gdb.py", line 1912, in find_in_live
gdb.error: Error occurred in Python: is_containing_page_free() takes exactly 1 argument (2 given)
Traceback (most recent call last):
File "~/scylla/scylla/scylla-gdb.py", line 1422, in invoke
gdb.error: Error occurred in Python: Error occurred in Python: is_containing_page_free() takes exactly 1 argument (2 given)
gdb-cmds:7: Error in sourced command file:
get-vtables2:52: Error in sourced command file:
Error occurred in Python: Error occurred in Python: Error occurred in Python: is_containing_page_free() takes exactly 1 argument (2 given)
using master's scylla-gdb.py also fails:
gdb.error: Error occurred in Python: Error occurred in Python: There is no member or method named _store.
I will try to backport just the fixes.
the version (3.1.0.rc8-0.20190925.d7fc7bcf9 d7fc7bcf9f306e093a955caee5edd6cf3c2e1232) doesn't include the fix for nodetool toppartitions (b62bb036ed6e235e340d2e9955dd13c0505ec0f3) @aleksbykov / @bentsi did we have a nodetool toppartitions nemesis prior to this in the run in the same node that did the drain
OK, the brute force search for vtables found something:
00000000049e8a10 _ZTVN7seastar12continuationIZZNS_6futureIJEE9then_implIZZNS_5asyncIZZN7service15storage_service5drainEvENKUlRS6_E_clES7_EUlvE_JEEENS_8futurizeINSt9result_ofIFNSt5decayIT_E4typeEDpNSC_IT0_E4typeEEE4typeEE4typeENS_17thread_attributesEOSD_DpOSG_ENUlRZNS4_IS9_JEEESO_SP_SQ_SS_E4workE_clESU_EUlvE0_S2_EET0_SQ_ENKUlvE_clEvEUlSQ_E_JEEE
Shard 0:
thread 1, small (size <= 80), live (0x600008151140 +0)
00000000049e8470 _ZTVN7seastar12continuationIZZNS_6futureIJEE17then_wrapped_implINS2_12finally_bodyIZZN7service15storage_service17run_with_api_lockIZNS6_5drainEvEUlRS6_E_EEDaNS_13basic_sstringIcjLj15ELb1EEEOT_ENUlS8_E_clES8_EUlvE_Lb0EEES2_EET0_SD_ENKUlvE_clEvEUlSD_E_JEEE
Shard 0:
thread 1, small (size <= 64), live (0x60000b9e9500 +0)
00000000049e7600 _ZTVN7seastar12continuationIZZNS_6futureIJEE9then_implIZN7service15storage_service10do_stop_msEvEUlvE0_S2_EET0_OT_ENKUlvE_clEvEUlS9_E_JEEE
Shard 0:
thread 1, small (size <= 64), live (0x600010555c80 +0)
00000000049e8e98 _ZTVN7seastar12continuationIZZNS_6futureIJEE17then_wrapped_implINS2_12finally_bodyIZNS_3smp9submit_toIZNS_7shardedIN7service15storage_serviceEE9invoke_onIZNS9_17run_with_api_lockIZNS9_5drainEvEUlRS9_E_EEDaNS_13basic_sstringIcjLj15ELb1EEEOT_EUlSD_E_S2_EET0_jSI_EUlvE_EENS_8futurizeINSt9result_ofIFSH_vEE4typeEE4typeEjSI_EUlvE_Lb0EEES2_EESK_SI_ENKUlvE_clEvEUlSI_E_JEEE
Shard 0:
thread 1, small (size <= 64), live (0x60000a746240 +0)
0000000004ad9c58 _ZTVN7seastar12continuationIZZNS_6futureIJEE9then_implIZZN3api19set_storage_serviceERNS4_12http_contextERNS_5httpd6routesEENKUlSt10unique_ptrINS7_7requestESt14default_deleteISB_EEE46_clESE_EUlvE_NS1_IJNS_4json16json_return_typeEEEEEET0_OT_ENKUlvE_clEvEUlSM_E_JEEE
Shard 0:
thread 1, small (size <= 112), live (0x60000c8b28c0 +0)
I would still like to know why they can't be found by looking at what is blocked in the _io_queues, but I will leave that for another time and continue with the above continuations.
I am rotating back to UDF for a couple of days. If anyone wants to take a look during that time, there is an easy to use package with the binaries and the core at gs://scratch.scylladb.com/espindola/issue-4891/debug-4891.tar.xz. Just start gdb with "gdb -x gdb-cmds".
I am uploading a new one to gs://scratch.scylladb.com/espindola/issue-4891/debug-4891-2.tar.zst that includes the debug info of all libraries and a fixed pretty printer. Should finished in about 1 day.
Probably the next step is to follow the continuation/promise chain starting with one of the continuations from the previous comment.
The drain api wil call into
storage_service::drain: 2568 ss.set_mode(mode::DRAINING, "starting drain process", true); 2569 ss.shutdown_client_servers().get(); 2570 gms::stop_gossiping().get(); 2571 2572 ss.set_mode(mode::DRAINING, "shutting down messaging_service", false); 2573 ss.do_stop_ms().get();
The next log message we are supposed to see is "shutting down messaging_service" but we did not. We see cql server is stopped ("storage_service - CQL server stopped") which means the ss.shutdown_client_servers().get() is done. So, probably the drain hang inside gms::stop_gossiping. The only place inside gms::stop_gossiping is timer_callback_lock().get() which takes a semaphore.
@espindola FYI, from the log, it suggests it hangs inside gms::stop_gossiping. The only place inside gms::stop_gossiping is timer_callback_lock().get() which takes a semaphore. Is it possible to check if the code hangs waiting for the timer_callback_lock inside gms::stop_gossiping? The following patch in master is supposed to help debug and remove the possible unbalanced semaphore calls.
commit 19e8c14ad1a1c79a0432dd4b3b3b84c1ba36927b
Author: Asias He <asias@scylladb.com>
Date: Wed Sep 25 15:46:38 2019 +0800
gossiper: Improve the gossip timer callback lock handling (#5097)
- Update the outdated comments in do_stop_gossiping. It was
storage_service not storage_proxy that used the lock. More
importantly, storage_service does not use it any more.
- Drop the unused timer_callback_lock and timer_callback_unlock API
- Use with_semaphore to make sure the semaphore usage is balanced.
- Add log in gossiper::do_stop_gossiping when it tries to take the
semaphore to help debug hang during the shutdown.
Refs: #4891
Refs: #4971
@avikivity please consider backporting 19e8c14ad1a1c79a0432dd4b3b3b84c1ba36927b to 3.1.x
The first continuation (0x600008151140) is just
future<> storage_service::drain() {
return run_with_api_lock(sstring("drain"), [] (storage_service& ss) {
return seastar::async([&ss] { // this lambda
Which is just an extra confirmation that we are between "starting drain process" and "starting drain process"
The second continuation (0x60000b9e9500) is from inside the run_with_api_lock call in storage_service::drain.
The third continuation (0x0x600010555c80) is odd, it is the then in
future<> storage_service::do_stop_ms() {
if (_ms_stopped) {
return make_ready_future<>();
}
_ms_stopped = true;
return netw::get_messaging_service().invoke_on_all([] (auto& ms) {
return ms.stop();
}).then([] { // this then
slogger.info("messaging_service stopped");
});
}
Not sure which call of do_stop_ms it is from.
The fourth continuation (0x60000a746240) is the invoke_on in
template <typename Func>
auto run_with_api_lock(sstring operation, Func&& func) {
return get_storage_service().invoke_on(0, [operation = std::move(operation),
The fifth continuation (0x60000c8b28c0) is from within set_storage_service in api/storage_service.cc. I will follow the chain for 0x0x600010555c80 ( the one in do_stop_ms).
The reference is from a struct that contain a promise:
0x60000b3cddc8: 0x1 0x0 0x60000b3cddd8: 0x0 0x0 // _future pointer 0x60000b3cdde8: 0x1 //_future_state (in future state) 0x600010555c90 //_state 0x60000b3cddf8: 0x600010555c80 // _task
Looks like the above structure is a parallel_for_each_state.
There is a then_wrapped_impl continuation at 0x600008b798c0 that refers to the parallel_for_each_state, which is likely the
f.then_wrapped([state] (future<> f) {
in parallel_for_each.
That continuation is referenced from 0x60000bc9bcb0, which is a smp_message_queue::async_work_item.
Thread 27 has a reference to that async_work_item. The reference is from 0x60700b3d2580, which is the continuation for the then_wrapped at
futurator::apply(this->_func).then_wrapped([this] (auto f) {
That is referenced from the continuation at 0x607000dcb1c0 which is from discard_result.
The discard_result continuation is referenced from 0x6070093a5900, which is a when_all_state, so I think this corresponds to
return when_all(stop_nontls_server(), stop_tls_server(), stop_client()).discard_result();
_nr_remain is 2, so stop_client has completed, but stop_tls_server is still pending.
The _continuation is 0x6070093a5950. It is refenced by a discard_result at 0x607000f65310, which is referenced by another when_all_state at 0x6070093a3100. This new when_all_state has _nr_remain = 3 and nr = 3.
The _continuation is 0x6070093a3150, it is referenced by 0x607000f24f00, which is not virtual, but looks like a netw::messaging_service::rpc_protocol_server_wrapper. The value is
<seastar::rpc::protocol<netw::serializer, netw::messaging_verb>::server> = {
<seastar::rpc::server> = {
_proto = 0x607000766150,
_ss = {
_ssi = 0x6070018cc8c0,
_aborted = true
},
_limits = {
basic_request_size = 1000,
bloat_factor = 3,
max_memory = 684007096,
isolate_connection = seastar::rpc::default_isolate_connection
},
_resources_available = {
_count = 0,
_ex = {_M_exception_object = 0x60700bfa8800},
_wait_list = empty expiring_fifo
},
_conns = std::unordered_map with 0 elements,
_ss_stopped = {
_future = 0x6070093a3140,
_local_state = {
st = seastar::future_state_base::state::result,
},
_state = 0x607000f24fd8,
_task = nullptr,
},
_reply_gate = {
_count = 1,
_stopped = {
[contained value] = {
_future = 0x0,
_local_state = {
st = seastar::future_state_base::state::future
},
_state = 0x6070093a3160,
_task = 0x6070093a3150,
}
}
},
_options = {
compressor_factory = 0x0,
tcp_nodelay = true,
streaming_domain = {
[contained value] = {_id = 30668}
},
load_balancing_algorithm = seastar::server_socket::load_balancing_algorithm::port
},
_next_client_id = 142
}
}
So we are blocked on the reply gate.
Searching for all continuations matching _ZTVN7seastar12continuation.*with_gate in thread 27 finds 0x6070011ccc00, which corresponds to the finally in with_gate in the call
with_gate(client->get_server().reply_gate(), [client, timeout, msg_id, data = std::move(data), permit = std::move(permit), &func] () mutable {
in recv_helper.
0x6070011ccc00 is referenced by 0x607000acad00, which I think is the continuation for the then_wrapped in
return apply(func, client->info(), timeout, WantClientInfo(), WantTimePoint(), signature(), std::move(args)).then_wrapped([client, timeout, msg_id, permit = std::move(permit)] (futurize_t
That in turn is used by 0x60700b15d140, which is a do_with_state. The do_with_state is referenced by 0x60700c63ddc0, which corresponds to
).then_wrapped([trace_state_ptr, reply_to, shard, response_id, &errors, &p] (future<std::tuple<future<>, future<>>>&& f) {
in storage_proxy::init_messaging_service
Next up is the when_all_state from
return when_all(
// mutate_locally() may throw, putting it into apply() converts exception to a future.
_nr_remain is 1, so it is the first future that is blocked. The _continuation is 0x6070001b7120, which is referenced by 0x60700c3987d0, which is the handle_exception:
}).handle_exception([reply_to, shard, &p, &errors] (std::exception_ptr eptr) {
0x60700c3987d0 is referenced by 0x607008e28540, which is probably
p->get_view_update_backlog()).then_wrapped([] (future<> f) {
It is referenced by 0x60700bee1ee0.
0x60700bee1ee0 is the then_wrapped in
return rpc_handler(rpc_client, std::forward
in send_message.
It is referenced by 0x6070004c6d20, which is the then in
return when_all(dst.send(std::move(data), timeout, cancel), wait_for_reply<Serializer>(wait(), timeout, cancel, dst, msg_id, sig)).then([] (auto r) {
in rpc_impl.hh That is referenced by 0x60700a0141c0, which is a when_all_state.
_nr_remain is 1, so it is the dst.send that is blocked. The _continuation is 0x60700a014200, which is referenced by 0x607000e30480, which doesn't have a virtual table.
0x607000e30480 is a std::list node, and 0x607000e30490 is a seastar::rpc::connection::outgoing_entry.
The connection with the outgoing_entry is at 0x60700a499e00. _error is true, _connected is true.
@gleb-cloudius is it expected that a connection can has _error set to true but still be connected and have outgoing entries?
@avi, btw, this reminds me somewhat of issue #4971: shutdown is stuck sending a message.
I wonder if this could be caused by rpc use after free, similar to #5228
@xemul can you please check if rpc handlers are unregistered on this path, e.g. via storage_proxy::stop -> uninit_messaging_service
?
They are, but the storage_proxy::stop() is not called by anybody %)
Closing as a guess that this is related to #5228
Removed spurious backport candidate label.
Installation details Scylla version (or git commit hash): 3.1.0.rc3-0.20190818.d06bcef3b Cluster size: 5 OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-05a82ec4a4fd22be2
Events that happened before drain:
nodetool drain stuck
Db log
in the log above after drain we see:
@avikivity can it be related to #4615