repair_additional_test.TestRepairAdditional.test_repair_kill_1 is flaky

raphaelsc commented 2 months ago

test produced a core: https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release-with-consistent-topology-changes/1980/testReport/junit/repair_additional_test/TestRepairAdditional/FullDtest___full_split000___test_repair_kill_1/

dtest logs tells me shutdown is stuck:

15:45:46,297 710     ccm                            WARNING  cluster.py          :760  | test_repair_kill_1: node1: node1 is still running after 127 seconds. Trying to generate coredump using kill(160041, SIGQUIT)...

2 minutes and a few seconds ago, I see this in node1 logs:

WARN  2024-06-10 15:43:39,284 [shard 1:strm] repair - repair[19495ec7-ed16-4ef5-841e-5a0cbddd33fe]: Repair job aborted by user, job=19495ec7-ed16-4ef5-841e-5a0cbddd33fe, keyspace=ks, tables=["cf"]
INFO  2024-06-10 15:43:39,284 [shard 0:main] init - Shutting down rpc controller API was successful
WARN  2024-06-10 15:43:39,284 [shard 1:strm] repair - repair[19495ec7-ed16-4ef5-841e-5a0cbddd33fe]: shard=1, keyspace=ks, cf=cf, range=(-7818817187590355340,-7791628886953512226], got error in row level repair: seastar::abort_requested_exception (abort requested)
INFO  2024-06-10 15:43:39,284 [shard 0:main] init - Shutting down transport controller API
INFO  2024-06-10 15:43:39,284 [shard 0:main] init - Shutting down transport controller API was successful
INFO  2024-06-10 15:43:39,284 [shard 0:main] init - Shutting down local storage
INFO  2024-06-10 15:43:39,284 [shard 0:main] storage_service - Stop transport: starts
INFO  2024-06-10 15:43:39,284 [shard 0:main] migration_manager - stopping migration service
INFO  2024-06-10 15:43:39,284 [shard 1:main] migration_manager - stopping migration service
WARN  2024-06-10 15:43:39,284 [shard 0:strm] repair - repair[19495ec7-ed16-4ef5-841e-5a0cbddd33fe]: Repair job aborted by user, job=19495ec7-ed16-4ef5-841e-5a0cbddd33fe, keyspace=ks, tables=["cf"]
WARN  2024-06-10 15:43:39,284 [shard 0:strm] repair - repair[19495ec7-ed16-4ef5-841e-5a0cbddd33fe]: shard=0, keyspace=ks, cf=cf, range=(-7758241491248627406,-7678839300536941111], got error in row level repair: seastar::abort_requested_exception (abort requested)
INFO  2024-06-10 15:43:39,289 [shard 1:strm] repair - repair[19495ec7-ed16-4ef5-841e-5a0cbddd33fe]: stats: repair_reason=repair, keyspace=ks, tables=["cf"], ranges_nr=513, round_nr=89, round_nr_fast_path_already_synced=45, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=43, rpc_call_nr=378, tx_hashes_nr=0, rx_hashes_nr=52, duration=0.21897805 seconds, tx_row_nr=26, rx_row_nr=47, tx_row_bytes=8424, rx_row_bytes=26508, row_from_disk_bytes={127.0.12.1: 9072, 127.0.12.2: 15552}, row_from_disk_nr={127.0.12.1: 28, 127.0.12.2: 48}, row_from_disk_bytes_per_sec={127.0.12.1: 0.03950959, 127.0.12.2: 0.06773073} MiB/s, row_from_disk_rows_per_sec={127.0.12.1: 127.8667, 127.0.12.2: 219.20006} Rows/s, tx_row_nr_peer={127.0.12.2: 26}, rx_row_nr_peer={127.0.12.2: 47}
WARN  2024-06-10 15:43:39,289 [shard 1:strm] repair - repair[19495ec7-ed16-4ef5-841e-5a0cbddd33fe]: 1 out of 513 ranges failed, keyspace=ks, tables=["cf"], repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A
INFO  2024-06-10 15:43:39,340 [shard 0: gms] raft_topology - updating topology state: Starting new topology coordinator 4bf07c57-5c57-40fe-92ef-b77cda7eb6e6
WARN  2024-06-10 15:43:39,343 [shard 0: gms] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
INFO  2024-06-10 15:43:39,345 [shard 0: gms] raft - [4bf07c57-5c57-40fe-92ef-b77cda7eb6e6] applier fiber stopped because state machine was aborted: abort requested
WARN  2024-06-10 15:43:40,047 [shard 0: gms] gossip - === Gossip round FAIL: seastar::abort_requested_exception (abort requested)
INFO  2024-06-10 15:43:40,047 [shard 0: gms] gossip - Gossip loop is not scheduled because it is disabled
WARN  2024-06-10 15:43:40,053 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
WARN  2024-06-10 15:43:40,053 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
WARN  2024-06-10 15:43:40,053 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
WARN  2024-06-10 15:43:42,053 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
INFO  2024-06-10 15:43:49,296 [shard 0:strm] repair - repair[19495ec7-ed16-4ef5-841e-5a0cbddd33fe]: stats: repair_reason=repair, keyspace=ks, tables=["cf"], ranges_nr=513, round_nr=89, round_nr_fast_path_already_synced=47, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=41, rpc_call_nr=390, tx_hashes_nr=0, rx_hashes_nr=49, duration=10.22707 seconds, tx_row_nr=37, rx_row_nr=30, tx_row_bytes=11988, rx_row_bytes=16920, row_from_disk_bytes={127.0.12.1: 12636, 127.0.12.2: 10368}, row_from_disk_nr={127.0.12.1: 39, 127.0.12.2: 32}, row_from_disk_bytes_per_sec={127.0.12.1: 0.0011783071, 127.0.12.2: 0.00096681603} MiB/s, row_from_disk_rows_per_sec={127.0.12.1: 3.813409, 127.0.12.2: 3.128951} Rows/s, tx_row_nr_peer={127.0.12.2: 37}, rx_row_nr_peer={127.0.12.2: 30}
WARN  2024-06-10 15:43:49,296 [shard 0:strm] repair - repair[19495ec7-ed16-4ef5-841e-5a0cbddd33fe]: 1 out of 513 ranges failed, keyspace=ks, tables=["cf"], repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A
INFO  2024-06-10 15:43:49,297 [shard 0:strm] repair - repair[19495ec7-ed16-4ef5-841e-5a0cbddd33fe]: Started to shutdown off-strategy compaction updater
INFO  2024-06-10 15:43:49,297 [shard 0:strm] repair - repair[19495ec7-ed16-4ef5-841e-5a0cbddd33fe]: Finished to shutdown off-strategy compaction updater
WARN  2024-06-10 15:43:49,297 [shard 0:strm] repair - repair[19495ec7-ed16-4ef5-841e-5a0cbddd33fe]: user-requested repair failed: std::runtime_error (["shard 0: std::runtime_error (repair[19495ec7-ed16-4ef5-841e-5a0cbddd33fe]: 1 out of 513 ranges failed, keyspace=ks, tables=[\"cf\"], repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A)", "shard 1: std::runtime_error (repair[19495ec7-ed16-4ef5-841e-5a0cbddd33fe]: 1 out of 513 ranges failed, keyspace=ks, tables=[\"cf\"], repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A)"])
WARN  2024-06-10 15:44:40,053 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
WARN  2024-06-10 15:45:40,053 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)

so shutdown is stuck when stopping migration

maybe a dup of https://github.com/scylladb/scylladb/issues/18320

Annamikhlin commented 2 months ago

seen also here https://jenkins.scylladb.com/job/scylla-enterprise/job/next/3010/testReport/junit/repair_additional_test/TestRepairAdditional/Build___x86___dtest_with_topology_changes___test_repair_kill_1/

mykaul commented 2 months ago

seen also here https://jenkins.scylladb.com/job/scylla-enterprise/job/next/3010/testReport/junit/repair_additional_test/TestRepairAdditional/Build___x86___dtest_with_topology_changes___test_repair_kill_1/

Logs? (Jenkins is playing hard to get, can't reach the above page).

mykaul commented 2 months ago

node1-scylla.113233.1719945271.core.gz test_repair_kill_1_node2.log.log test_repair_kill_1_node1.log.log

(I couldn't see in the log when the coredump happens)

nyh commented 2 months ago

This issue seems is a duplicate of #18123 - @tchaikov and @raphaelsc please leave only one of them

Seen again in https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release-with-consistent-topology-changes/2406/testReport/junit/repair_additional_test/TestRepairAdditional/FullDtest___full_split000___test_repair_kill_1/

As always, the test failed because of a coredump (which I didn't try to analyze)

tchaikov commented 2 months ago

thanks Nadav, closed #18123

bhalevy commented 1 month ago

Based on https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release-with-consistent-topology-changes/2522/artifact/logs-full.release.000/1720557676267_repair_additional_test.py%3A%3ATestRepairAdditional%3A%3Atest_repair_kill_3/node1.log

INFO  2024-07-09 20:38:44,895 [shard 0:main] storage_service - Stop transport: starts
INFO  2024-07-09 20:38:44,895 [shard 0:main] migration_manager - stopping migration service
WARN  2024-07-09 20:38:44,895 [shard 1:strm] repair - repair[8524dd09-82a4-42a0-a5dd-05d5dd109fe5]: Repair job aborted by user, job=8524dd09-82a4-42a0-a5dd-05d5dd109fe5, keyspace=ks, tables=["cf"]
INFO  2024-07-09 20:38:44,896 [shard 1:main] migration_manager - stopping migration service
INFO  2024-07-09 20:38:44,901 [shard 0:strm] repair - repair[8524dd09-82a4-42a0-a5dd-05d5dd109fe5]: stats: repair_reason=repair, keyspace=ks, tables=["cf"], ranges_nr=513, round_nr=1, round_nr_fast_path_already_synced=0, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=5, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.008494865 seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={127.0.54.1: 0, 127.0.54.2: 324}, row_from_disk_nr={127.0.54.1: 0, 127.0.54.2: 1}, row_from_disk_bytes_per_sec={127.0.54.1: 0, 127.0.54.2: 0.036373794} MiB/s, row_from_disk_rows_per_sec={127.0.54.1: 0, 127.0.54.2: 117.71817} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
WARN  2024-07-09 20:38:44,901 [shard 0:strm] repair - repair[8524dd09-82a4-42a0-a5dd-05d5dd109fe5]: 1 out of 513 ranges failed, keyspace=ks, tables=["cf"], repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A
INFO  2024-07-09 20:38:44,901 [shard 1:strm] repair - repair[8524dd09-82a4-42a0-a5dd-05d5dd109fe5]: stats: repair_reason=repair, keyspace=ks, tables=["cf"], ranges_nr=513, round_nr=1, round_nr_fast_path_already_synced=1, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=5, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.007847108 seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={127.0.54.1: 0, 127.0.54.2: 0}, row_from_disk_nr={127.0.54.1: 0, 127.0.54.2: 0}, row_from_disk_bytes_per_sec={127.0.54.1: 0, 127.0.54.2: 0} MiB/s, row_from_disk_rows_per_sec={127.0.54.1: 0, 127.0.54.2: 0} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
WARN  2024-07-09 20:38:44,901 [shard 1:strm] repair - repair[8524dd09-82a4-42a0-a5dd-05d5dd109fe5]: 0 out of 513 ranges failed, keyspace=ks, tables=["cf"], repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A
INFO  2024-07-09 20:38:44,901 [shard 0:strm] repair - repair[8524dd09-82a4-42a0-a5dd-05d5dd109fe5]: Started to shutdown off-strategy compaction updater
INFO  2024-07-09 20:38:44,901 [shard 0:strm] repair - repair[8524dd09-82a4-42a0-a5dd-05d5dd109fe5]: Finished to shutdown off-strategy compaction updater
WARN  2024-07-09 20:38:44,901 [shard 0:strm] repair - repair[8524dd09-82a4-42a0-a5dd-05d5dd109fe5]: user-requested repair failed: std::runtime_error (["shard 0: std::runtime_error (repair[8524dd09-82a4-42a0-a5dd-05d5dd109fe5]: 1 out of 513 ranges failed, keyspace=ks, tables=[\"cf\"], repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A)", "shard 1: std::runtime_error (repair[8524dd09-82a4-42a0-a5dd-05d5dd109fe5]: 0 out of 513 ranges failed, keyspace=ks, tables=[\"cf\"], repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A)"])
INFO  2024-07-09 20:38:44,922 [shard 0: gms] raft_topology - updating topology state: Starting new topology coordinator ed3f880a-2162-4296-9bf0-1f4b2f93ca39
WARN  2024-07-09 20:38:44,926 [shard 0: gms] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
INFO  2024-07-09 20:38:44,930 [shard 0: gms] raft - [ed3f880a-2162-4296-9bf0-1f4b2f93ca39] applier fiber stopped because state machine was aborted: abort requested
WARN  2024-07-09 20:38:45,529 [shard 0: gms] gossip - === Gossip round FAIL: seastar::abort_requested_exception (abort requested)
INFO  2024-07-09 20:38:45,529 [shard 0: gms] gossip - Gossip loop is not scheduled because it is disabled
WARN  2024-07-09 20:38:46,132 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
WARN  2024-07-09 20:38:46,132 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
WARN  2024-07-09 20:38:48,133 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
WARN  2024-07-09 20:39:46,132 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
WARN  2024-07-09 20:40:46,133 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)

It looks like migration_manager::drain() hangs.

bhalevy commented 1 month ago

@kbr-scylla this issue is killing us in CI, can you please look into this? I have an unproved suspicion that _group0_as.request_abort() is never called and:

INFO  2024-07-09 20:38:44,922 [shard 0: gms] raft_topology - updating topology state: Starting new topology coordinator ed3f880a-2162-4296-9bf0-1f4b2f93ca39

may indicate that the topology_coordinator might be hung in add_entry

kbr-scylla commented 1 month ago

can you please look into this?

I can, but more important question is, why is this issue opened for a month and it was not assigned since

bhalevy commented 1 month ago

can you please look into this?

I can, but more important question is, why is this issue opened for a month and it was not assigned since

Good question. Maybe because it wasn't labeled with the triage label until yesterday. I wonder if we can automate the process better. Cc @dani-tweig @mykaul

mykaul commented 1 month ago

can you please look into this?

I can, but more important question is, why is this issue opened for a month and it was not assigned since

Good question. Maybe because it wasn't labeled with the triage label until yesterday. I wonder if we can automate the process better. Cc @dani-tweig @mykaul

We have many such issues - sometimes when I see progress made, but no ownership, I let it continue (and it still has the 'triage' label. I see little point in assigning to someone, but without progress - this is what I'm looking at really. I could assign sooner - will that make folks more accountable? We can try.

kbr-scylla commented 1 month ago

I'm trying to triage within few days from getting assigned, and then after triaging:

if it turns out to be other team's area of responsibility, or it's unclear but expertise of other team is needed to continue triage, I reassign to that other team
if it turns out that this is my team's responsibility:
- if it's super urgent (looks like P0 / release blocker), I try to fix it or interrupt someone from my team to fix it; fortunately this is a rare case
- otherwise it goes to The Board, gets a size estimate and a PX label, and hence it will be considered in the next quarterly planning

But if nobody looks at it, if it doesn't even go to the board, then it doesn't even get the chance of being properly worked on in the next quarter, because during planning it's outside everyone's awareness

mykaul commented 1 month ago

@kbr-scylla - any updates on this one?

emdroid commented 1 month ago

@mykaul I'm currently looking into it. The reproduction rate of "test_repair_kill_1" seems to be fairly low, but based on #19555 I'm currently trying to reproduce locally with "test_repair_kill_3" which looks it might be reproducible more often.

Will update the ticket with any further findings.

bhalevy commented 1 month ago

Some statistics about the failure rate here: https://github.com/scylladb/scylladb/pull/19555#issuecomment-2227723165

emdroid commented 1 month ago

Findings so far:

I was first investigating the possibility of add_entry() being stuck, increased and added some logging, but it doesn't look like that being the culprit:

INFO  2024-07-19 18:42:50,407 [shard 0: gms] raft_topology - updating topology state: Starting new topology coordinator 816d7b84-5666-4571-85f6-5ed274d02436
INFO  2024-07-19 18:42:50,407 [shard 0: gms] group0_client - emaskovsky: add_entry(): _raft_gr.group0_with_timeouts().add_entry
INFO  2024-07-19 18:42:50,407 [shard 0: gms] raft - emaskovsky: add_entry: enter
INFO  2024-07-19 18:42:50,407 [shard 0: gms] raft - [816d7b84-5666-4571-85f6-5ed274d02436] an entry is submitted
INFO  2024-07-19 18:42:50,407 [shard 0: gms] raft - [816d7b84-5666-4571-85f6-5ed274d02436] an entry proceeds on a leader
INFO  2024-07-19 18:42:50,407 [shard 0: gms] raft - [816d7b84-5666-4571-85f6-5ed274d02436] adding a forwarded entry from 816d7b84-5666-4571-85f6-5ed274d02436
INFO  2024-07-19 18:42:50,407 [shard 0: gms] raft - [816d7b84-5666-4571-85f6-5ed274d02436] waiting for memory permit
INFO  2024-07-19 18:42:50,407 [shard 0: gms] raft - [816d7b84-5666-4571-85f6-5ed274d02436] adding entry after waiting for memory permit
INFO  2024-07-19 18:42:50,407 [shard 0: gms] raft - emaskovsky: fsm::add_entry: enter
INFO  2024-07-19 18:42:50,407 [shard 0: gms] raft - emaskovsky: fsm::add_entry: checkpoint 1
INFO  2024-07-19 18:42:50,407 [shard 0: gms] raft - emaskovsky: fsm::add_entry: checkpoint 2
INFO  2024-07-19 18:42:50,407 [shard 0: gms] raft - emaskovsky: fsm::add_entry: checkpoint 3
INFO  2024-07-19 18:42:50,407 [shard 0: gms] raft - emaskovsky: fsm::add_entry: leaving
INFO  2024-07-19 18:42:50,407 [shard 0: gms] raft - [816d7b84-5666-4571-85f6-5ed274d02436] add_entry_on_leader: done
WARN  2024-07-19 18:42:50,410 [shard 0: gms] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
INFO  2024-07-19 18:42:50,412 [shard 0: gms] raft - [816d7b84-5666-4571-85f6-5ed274d02436] applier fiber stopped because state machine was aborted: abort requested

it looks like the the coordinator is being started new for some reason ("Starting new topology coordinator") but that doesn't seem like an issue, as it is being exited right after ("applier fiber stopped ..."). The add_entry also doesn't seem to be looping or stuck and is eventually exiting (although the abort_source is not being checked explicitly there, maybe checking it before performing the more complex workflows there could improve things, but in general doesn't look like it is the culprit).

However what I can see that the shutdown is still seemingly hung somewhere, and in particular there are logs from "gossip" trying to apply state (added the logs to see which states it is actually trying to apply), which goes on for about 2 minutes until apparently being killed:

WARN  2024-07-19 18:42:51,114 [shard 0: gms] gossip - === Gossip round FAIL: seastar::abort_requested_exception (abort requested)
INFO  2024-07-19 18:42:51,114 [shard 0: gms] gossip - Gossip loop is not scheduled because it is disabled
INFO  2024-07-19 18:42:51,616 [shard 0:main] gossip - emaskovsky: Applying states: {LOAD: Value(923992,40)}
WARN  2024-07-19 18:42:51,616 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
INFO  2024-07-19 18:42:51,616 [shard 0:main] gossip - emaskovsky: Applying states: {CACHE_HITRATES: Value(system_traces.sessions_time_idx:0.000000;system_traces.node_slow_log_time_idx:0.000000;system_traces.node_slow_log:0.000000;ks.cf:0.000000;system_distributed.cdc_generation_timestamps:0.000000;system_distributed.view_build_status:0.000000;system_distributed.service_levels:0.000000;system_distributed_everywhere.cdc_generation_descriptions_v2:0.000000;system_traces.events:0.000000;system_traces.sessions:0.000000;system_distributed.cdc_streams_descriptions_v2:0.000000;,41)}
WARN  2024-07-19 18:42:51,616 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
INFO  2024-07-19 18:42:53,616 [shard 0:main] gossip - emaskovsky: Applying states: {CACHE_HITRATES: Value(system_distributed.view_build_status:0.000000;system_distributed.service_levels:0.000000;system_distributed_everywhere.cdc_generation_descriptions_v2:0.000000;ks.cf:0.000000;system_distributed.cdc_generation_timestamps:0.000000;system_traces.sessions_time_idx:0.000000;system_traces.node_slow_log_time_idx:0.000000;system_traces.node_slow_log:0.000000;system_traces.events:0.000000;system_traces.sessions:0.000000;system_distributed.cdc_streams_descriptions_v2:0.000000;,42)}
WARN  2024-07-19 18:42:53,616 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
INFO  2024-07-19 18:43:51,616 [shard 0:main] gossip - emaskovsky: Applying states: {LOAD: Value(923992,43)}
WARN  2024-07-19 18:43:51,616 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
INFO  2024-07-19 18:44:51,616 [shard 0:main] gossip - emaskovsky: Applying states: {LOAD: Value(923992,44)}
WARN  2024-07-19 18:44:51,616 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)

But not sure yet where these calls are coming from, will investigate this further.

In normal case we should see things as on the second node (that is not been stuck):

WARN  2024-07-19 18:45:19,544 [shard 0: gms] gossip - === Gossip round FAIL: seastar::abort_requested_exception (abort requested)
INFO  2024-07-19 18:45:19,544 [shard 0: gms] gossip - Gossip loop is not scheduled because it is disabled
INFO  2024-07-19 18:45:20,718 [shard 0:main] gossip - Disable and wait for gossip loop started
INFO  2024-07-19 18:45:20,718 [shard 0:main] gossip - Gossip is now stopped
INFO  2024-07-19 18:45:20,718 [shard 0:main] storage_service - Stop transport: stop_gossiping done
INFO  2024-07-19 18:45:20,718 [shard 0:main] messaging_service - Shutting down nontls server
INFO  2024-07-19 18:45:20,718 [shard 0:main] messaging_service - Shutting down tls server
INFO  2024-07-19 18:45:20,718 [shard 0:main] messaging_service - Shutting down tls server - Done

From this it looks like the gossip protocol might be stuck on the leader node somehow - not seeing the the logs on node 1 for "Disable and wait for gossip loop started" and "Gossip is now stopped". Looks like the gossip subsystem stays up on the node 1 for some reason and tries to continue doing stuff.

kbr-scylla commented 1 month ago

it looks like the the coordinator is being started new for some reason ("Starting new topology coordinator") but that doesn't seem like an issue, as it is being exited right after ("applier fiber stopped ...").

But does stopping applier fiber causes the topology coordinator fiber to stop too?

From this it looks like the gossip protocol might be stuck on the leader node somehow - not seeing the the logs on node 1 for "Disable and wait for gossip loop started" and "Gossip is now stopped". Looks like the gossip subsystem stays up on the node 1 for some reason and tries to continue doing stuff.

Is gossiper::stop() even called though?

emdroid commented 1 month ago

Not sure yet, but it might be it is not.

I added the additional logs and according to that it looks that the shutdown is always getting stuck on the call to _group0_barrier.join() - https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/433/artifact/logs-full.release.000/1721507983592_repair_additional_test.py%3A%3ATestRepairAdditional%3A%3Atest_repair_kill_3%5B33-100%5D/node1.log:

INFO  2024-07-20 20:37:12,127 [shard 0:main] storage_service - Stop transport: starts
INFO  2024-07-20 20:37:12,127 [shard 0:main] migration_manager - stopping migration service
INFO  2024-07-20 20:37:12,127 [shard 0:main] migration_manager - emaskovsky: drain(): checkpoint 1
INFO  2024-07-20 20:37:12,127 [shard 0:main] migration_manager - emaskovsky: drain(): checkpoint 2
INFO  2024-07-20 20:37:12,127 [shard 0:main] migration_manager - emaskovsky: drain(): _group0_barrier.join()
INFO  2024-07-20 20:37:12,127 [shard 0:strm] migration_manager - emaskovsky: Waiting for group0 barrier - start
INFO  2024-07-20 20:37:12,127 [shard 1:main] migration_manager - stopping migration service
INFO  2024-07-20 20:37:12,127 [shard 1:main] migration_manager - emaskovsky: drain(): checkpoint 1
INFO  2024-07-20 20:37:12,127 [shard 1:main] migration_manager - emaskovsky: drain(): checkpoint 2
INFO  2024-07-20 20:37:12,127 [shard 1:main] migration_manager - emaskovsky: drain(): _group0_barrier.join()

(if it would complete, there would be another log emaskovsky: drain(): exit as seen in the successful cases)

That is at this place in the code: https://github.com/scylladb/scylladb/blob/358147959e3e59718c14c091a832c4cced1e6c83/service/migration_manager.cc#L94

future<> migration_manager::drain()
{
    mlogger.info("stopping migration service");
    _as.request_abort();

    co_await uninit_messaging_service();
    try {
        co_await coroutine::parallel_for_each(_schema_pulls, [] (auto&& e) {
            return e.second.join();
        });
    } catch (...) {
        mlogger.error("schema_pull failed: {}", std::current_exception());
    }
    co_await _group0_barrier.join();  // <--
    co_await _background_tasks.close();
}

it looks like it is trying to join some kind of barrier (maybe read barrier?) but it gets stuck there.

emdroid commented 1 month ago

I was first thinking that the _as not being propagated into some operations might be an issue - for example, the migration_manager::get_schema_for_write() triggers the group barrier: https://github.com/scylladb/scylladb/blob/358147959e3e59718c14c091a832c4cced1e6c83/service/migration_manager.cc#L1097

but doesn't use the internal abort source _as, just the external one (as) so it should perhaps use a composite abort source (as the internal abort source is checked at the beginning of the function, but there still seems to be a race). However that doesn't seem to cause this particular issue, as there are no logs about the get_schema_for_write() running at the same time.

kbr-scylla commented 1 month ago

it looks like it is trying to join some kind of barrier (maybe read barrier?)

migration_manager::_group0_barrier is a serialized_action that batches calls: multiple requests to barrier get turned into a single group 0 read barrier. Once the current barrier finishes and there are requests scheduled, next barrier starts to satisfy these requests, and so on.

However that doesn't seem to cause this particular issue, as there are no logs about the get_schema_for_write() running at the same time.

Doesn't repair actually use this barrier?

kbr-scylla commented 1 month ago

Doesn't repair actually use this barrier?

Yes

repair/table_check.cc:

future<table_dropped> table_sync_and_check(replica::database& db, service::migration_manager& mm, const table_id& uuid) {
    if (mm.use_raft()) {
        // Trigger read barrier to synchronize schema.
        co_await mm.get_group0_barrier().trigger(mm.get_abort_source());
    }

    co_return !db.column_family_exists(uuid);
}

this could be our stuck call

Annamikhlin commented 1 month ago

seen again on https://jenkins.scylladb.com/job/scylla-6.1/job/next/22/testReport/junit/repair_additional_test/TestRepairAdditional/Build___x86___dtest_with_topology_changes___test_repair_kill_1/

yaronkaikov commented 1 month ago

Seen again https://jenkins.scylladb.com/job/scylla-master/job/next/8010/

scylladb / scylladb

repair_additional_test.TestRepairAdditional.test_repair_kill_1 is flaky #19223