Closed raphaelsc closed 1 month ago
Logs? (Jenkins is playing hard to get, can't reach the above page).
node1-scylla.113233.1719945271.core.gz test_repair_kill_1_node2.log.log test_repair_kill_1_node1.log.log
(I couldn't see in the log when the coredump happens)
This issue seems is a duplicate of #18123 - @tchaikov and @raphaelsc please leave only one of them
As always, the test failed because of a coredump (which I didn't try to analyze)
thanks Nadav, closed #18123
INFO 2024-07-09 20:38:44,895 [shard 0:main] storage_service - Stop transport: starts
INFO 2024-07-09 20:38:44,895 [shard 0:main] migration_manager - stopping migration service
WARN 2024-07-09 20:38:44,895 [shard 1:strm] repair - repair[8524dd09-82a4-42a0-a5dd-05d5dd109fe5]: Repair job aborted by user, job=8524dd09-82a4-42a0-a5dd-05d5dd109fe5, keyspace=ks, tables=["cf"]
INFO 2024-07-09 20:38:44,896 [shard 1:main] migration_manager - stopping migration service
INFO 2024-07-09 20:38:44,901 [shard 0:strm] repair - repair[8524dd09-82a4-42a0-a5dd-05d5dd109fe5]: stats: repair_reason=repair, keyspace=ks, tables=["cf"], ranges_nr=513, round_nr=1, round_nr_fast_path_already_synced=0, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=5, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.008494865 seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={127.0.54.1: 0, 127.0.54.2: 324}, row_from_disk_nr={127.0.54.1: 0, 127.0.54.2: 1}, row_from_disk_bytes_per_sec={127.0.54.1: 0, 127.0.54.2: 0.036373794} MiB/s, row_from_disk_rows_per_sec={127.0.54.1: 0, 127.0.54.2: 117.71817} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
WARN 2024-07-09 20:38:44,901 [shard 0:strm] repair - repair[8524dd09-82a4-42a0-a5dd-05d5dd109fe5]: 1 out of 513 ranges failed, keyspace=ks, tables=["cf"], repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A
INFO 2024-07-09 20:38:44,901 [shard 1:strm] repair - repair[8524dd09-82a4-42a0-a5dd-05d5dd109fe5]: stats: repair_reason=repair, keyspace=ks, tables=["cf"], ranges_nr=513, round_nr=1, round_nr_fast_path_already_synced=1, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=5, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.007847108 seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={127.0.54.1: 0, 127.0.54.2: 0}, row_from_disk_nr={127.0.54.1: 0, 127.0.54.2: 0}, row_from_disk_bytes_per_sec={127.0.54.1: 0, 127.0.54.2: 0} MiB/s, row_from_disk_rows_per_sec={127.0.54.1: 0, 127.0.54.2: 0} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
WARN 2024-07-09 20:38:44,901 [shard 1:strm] repair - repair[8524dd09-82a4-42a0-a5dd-05d5dd109fe5]: 0 out of 513 ranges failed, keyspace=ks, tables=["cf"], repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A
INFO 2024-07-09 20:38:44,901 [shard 0:strm] repair - repair[8524dd09-82a4-42a0-a5dd-05d5dd109fe5]: Started to shutdown off-strategy compaction updater
INFO 2024-07-09 20:38:44,901 [shard 0:strm] repair - repair[8524dd09-82a4-42a0-a5dd-05d5dd109fe5]: Finished to shutdown off-strategy compaction updater
WARN 2024-07-09 20:38:44,901 [shard 0:strm] repair - repair[8524dd09-82a4-42a0-a5dd-05d5dd109fe5]: user-requested repair failed: std::runtime_error (["shard 0: std::runtime_error (repair[8524dd09-82a4-42a0-a5dd-05d5dd109fe5]: 1 out of 513 ranges failed, keyspace=ks, tables=[\"cf\"], repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A)", "shard 1: std::runtime_error (repair[8524dd09-82a4-42a0-a5dd-05d5dd109fe5]: 0 out of 513 ranges failed, keyspace=ks, tables=[\"cf\"], repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=true, failed_because=N/A)"])
INFO 2024-07-09 20:38:44,922 [shard 0: gms] raft_topology - updating topology state: Starting new topology coordinator ed3f880a-2162-4296-9bf0-1f4b2f93ca39
WARN 2024-07-09 20:38:44,926 [shard 0: gms] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
INFO 2024-07-09 20:38:44,930 [shard 0: gms] raft - [ed3f880a-2162-4296-9bf0-1f4b2f93ca39] applier fiber stopped because state machine was aborted: abort requested
WARN 2024-07-09 20:38:45,529 [shard 0: gms] gossip - === Gossip round FAIL: seastar::abort_requested_exception (abort requested)
INFO 2024-07-09 20:38:45,529 [shard 0: gms] gossip - Gossip loop is not scheduled because it is disabled
WARN 2024-07-09 20:38:46,132 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
WARN 2024-07-09 20:38:46,132 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
WARN 2024-07-09 20:38:48,133 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
WARN 2024-07-09 20:39:46,132 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
WARN 2024-07-09 20:40:46,133 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
It looks like migration_manager::drain()
hangs.
@kbr-scylla this issue is killing us in CI, can you please look into this? I have an unproved suspicion that _group0_as.request_abort() is never called and:
INFO 2024-07-09 20:38:44,922 [shard 0: gms] raft_topology - updating topology state: Starting new topology coordinator ed3f880a-2162-4296-9bf0-1f4b2f93ca39
may indicate that the topology_coordinator might be hung in add_entry
can you please look into this?
I can, but more important question is, why is this issue opened for a month and it was not assigned since
can you please look into this?
I can, but more important question is, why is this issue opened for a month and it was not assigned since
Good question. Maybe because it wasn't labeled with the triage label until yesterday. I wonder if we can automate the process better. Cc @dani-tweig @mykaul
can you please look into this?
I can, but more important question is, why is this issue opened for a month and it was not assigned since
Good question. Maybe because it wasn't labeled with the triage label until yesterday. I wonder if we can automate the process better. Cc @dani-tweig @mykaul
We have many such issues - sometimes when I see progress made, but no ownership, I let it continue (and it still has the 'triage' label. I see little point in assigning to someone, but without progress - this is what I'm looking at really. I could assign sooner - will that make folks more accountable? We can try.
I'm trying to triage within few days from getting assigned, and then after triaging:
But if nobody looks at it, if it doesn't even go to the board, then it doesn't even get the chance of being properly worked on in the next quarter, because during planning it's outside everyone's awareness
@kbr-scylla - any updates on this one?
@mykaul I'm currently looking into it. The reproduction rate of "test_repair_kill_1" seems to be fairly low, but based on #19555 I'm currently trying to reproduce locally with "test_repair_kill_3" which looks it might be reproducible more often.
Will update the ticket with any further findings.
Some statistics about the failure rate here: https://github.com/scylladb/scylladb/pull/19555#issuecomment-2227723165
Findings so far:
I was first investigating the possibility of add_entry()
being stuck, increased and added some logging, but it doesn't look like that being the culprit:
INFO 2024-07-19 18:42:50,407 [shard 0: gms] raft_topology - updating topology state: Starting new topology coordinator 816d7b84-5666-4571-85f6-5ed274d02436
INFO 2024-07-19 18:42:50,407 [shard 0: gms] group0_client - emaskovsky: add_entry(): _raft_gr.group0_with_timeouts().add_entry
INFO 2024-07-19 18:42:50,407 [shard 0: gms] raft - emaskovsky: add_entry: enter
INFO 2024-07-19 18:42:50,407 [shard 0: gms] raft - [816d7b84-5666-4571-85f6-5ed274d02436] an entry is submitted
INFO 2024-07-19 18:42:50,407 [shard 0: gms] raft - [816d7b84-5666-4571-85f6-5ed274d02436] an entry proceeds on a leader
INFO 2024-07-19 18:42:50,407 [shard 0: gms] raft - [816d7b84-5666-4571-85f6-5ed274d02436] adding a forwarded entry from 816d7b84-5666-4571-85f6-5ed274d02436
INFO 2024-07-19 18:42:50,407 [shard 0: gms] raft - [816d7b84-5666-4571-85f6-5ed274d02436] waiting for memory permit
INFO 2024-07-19 18:42:50,407 [shard 0: gms] raft - [816d7b84-5666-4571-85f6-5ed274d02436] adding entry after waiting for memory permit
INFO 2024-07-19 18:42:50,407 [shard 0: gms] raft - emaskovsky: fsm::add_entry: enter
INFO 2024-07-19 18:42:50,407 [shard 0: gms] raft - emaskovsky: fsm::add_entry: checkpoint 1
INFO 2024-07-19 18:42:50,407 [shard 0: gms] raft - emaskovsky: fsm::add_entry: checkpoint 2
INFO 2024-07-19 18:42:50,407 [shard 0: gms] raft - emaskovsky: fsm::add_entry: checkpoint 3
INFO 2024-07-19 18:42:50,407 [shard 0: gms] raft - emaskovsky: fsm::add_entry: leaving
INFO 2024-07-19 18:42:50,407 [shard 0: gms] raft - [816d7b84-5666-4571-85f6-5ed274d02436] add_entry_on_leader: done
WARN 2024-07-19 18:42:50,410 [shard 0: gms] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
INFO 2024-07-19 18:42:50,412 [shard 0: gms] raft - [816d7b84-5666-4571-85f6-5ed274d02436] applier fiber stopped because state machine was aborted: abort requested
it looks like the the coordinator is being started new for some reason ("Starting new topology coordinator") but that doesn't seem like an issue, as it is being exited right after ("applier fiber stopped ..."). The add_entry
also doesn't seem to be looping or stuck and is eventually exiting (although the abort_source
is not being checked explicitly there, maybe checking it before performing the more complex workflows there could improve things, but in general doesn't look like it is the culprit).
However what I can see that the shutdown is still seemingly hung somewhere, and in particular there are logs from "gossip" trying to apply state (added the logs to see which states it is actually trying to apply), which goes on for about 2 minutes until apparently being killed:
WARN 2024-07-19 18:42:51,114 [shard 0: gms] gossip - === Gossip round FAIL: seastar::abort_requested_exception (abort requested)
INFO 2024-07-19 18:42:51,114 [shard 0: gms] gossip - Gossip loop is not scheduled because it is disabled
INFO 2024-07-19 18:42:51,616 [shard 0:main] gossip - emaskovsky: Applying states: {LOAD: Value(923992,40)}
WARN 2024-07-19 18:42:51,616 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
INFO 2024-07-19 18:42:51,616 [shard 0:main] gossip - emaskovsky: Applying states: {CACHE_HITRATES: Value(system_traces.sessions_time_idx:0.000000;system_traces.node_slow_log_time_idx:0.000000;system_traces.node_slow_log:0.000000;ks.cf:0.000000;system_distributed.cdc_generation_timestamps:0.000000;system_distributed.view_build_status:0.000000;system_distributed.service_levels:0.000000;system_distributed_everywhere.cdc_generation_descriptions_v2:0.000000;system_traces.events:0.000000;system_traces.sessions:0.000000;system_distributed.cdc_streams_descriptions_v2:0.000000;,41)}
WARN 2024-07-19 18:42:51,616 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
INFO 2024-07-19 18:42:53,616 [shard 0:main] gossip - emaskovsky: Applying states: {CACHE_HITRATES: Value(system_distributed.view_build_status:0.000000;system_distributed.service_levels:0.000000;system_distributed_everywhere.cdc_generation_descriptions_v2:0.000000;ks.cf:0.000000;system_distributed.cdc_generation_timestamps:0.000000;system_traces.sessions_time_idx:0.000000;system_traces.node_slow_log_time_idx:0.000000;system_traces.node_slow_log:0.000000;system_traces.events:0.000000;system_traces.sessions:0.000000;system_distributed.cdc_streams_descriptions_v2:0.000000;,42)}
WARN 2024-07-19 18:42:53,616 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
INFO 2024-07-19 18:43:51,616 [shard 0:main] gossip - emaskovsky: Applying states: {LOAD: Value(923992,43)}
WARN 2024-07-19 18:43:51,616 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
INFO 2024-07-19 18:44:51,616 [shard 0:main] gossip - emaskovsky: Applying states: {LOAD: Value(923992,44)}
WARN 2024-07-19 18:44:51,616 [shard 0:main] gossip - Fail to apply application_state: seastar::abort_requested_exception (abort requested)
But not sure yet where these calls are coming from, will investigate this further.
In normal case we should see things as on the second node (that is not been stuck):
WARN 2024-07-19 18:45:19,544 [shard 0: gms] gossip - === Gossip round FAIL: seastar::abort_requested_exception (abort requested)
INFO 2024-07-19 18:45:19,544 [shard 0: gms] gossip - Gossip loop is not scheduled because it is disabled
INFO 2024-07-19 18:45:20,718 [shard 0:main] gossip - Disable and wait for gossip loop started
INFO 2024-07-19 18:45:20,718 [shard 0:main] gossip - Gossip is now stopped
INFO 2024-07-19 18:45:20,718 [shard 0:main] storage_service - Stop transport: stop_gossiping done
INFO 2024-07-19 18:45:20,718 [shard 0:main] messaging_service - Shutting down nontls server
INFO 2024-07-19 18:45:20,718 [shard 0:main] messaging_service - Shutting down tls server
INFO 2024-07-19 18:45:20,718 [shard 0:main] messaging_service - Shutting down tls server - Done
From this it looks like the gossip protocol might be stuck on the leader node somehow - not seeing the the logs on node 1 for "Disable and wait for gossip loop started" and "Gossip is now stopped". Looks like the gossip subsystem stays up on the node 1 for some reason and tries to continue doing stuff.
it looks like the the coordinator is being started new for some reason ("Starting new topology coordinator") but that doesn't seem like an issue, as it is being exited right after ("applier fiber stopped ...").
But does stopping applier fiber causes the topology coordinator fiber to stop too?
From this it looks like the gossip protocol might be stuck on the leader node somehow - not seeing the the logs on node 1 for "Disable and wait for gossip loop started" and "Gossip is now stopped". Looks like the gossip subsystem stays up on the node 1 for some reason and tries to continue doing stuff.
Is gossiper::stop()
even called though?
Not sure yet, but it might be it is not.
I added the additional logs and according to that it looks that the shutdown is always getting stuck on the call to _group0_barrier.join()
- https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/433/artifact/logs-full.release.000/1721507983592_repair_additional_test.py%3A%3ATestRepairAdditional%3A%3Atest_repair_kill_3%5B33-100%5D/node1.log:
INFO 2024-07-20 20:37:12,127 [shard 0:main] storage_service - Stop transport: starts
INFO 2024-07-20 20:37:12,127 [shard 0:main] migration_manager - stopping migration service
INFO 2024-07-20 20:37:12,127 [shard 0:main] migration_manager - emaskovsky: drain(): checkpoint 1
INFO 2024-07-20 20:37:12,127 [shard 0:main] migration_manager - emaskovsky: drain(): checkpoint 2
INFO 2024-07-20 20:37:12,127 [shard 0:main] migration_manager - emaskovsky: drain(): _group0_barrier.join()
INFO 2024-07-20 20:37:12,127 [shard 0:strm] migration_manager - emaskovsky: Waiting for group0 barrier - start
INFO 2024-07-20 20:37:12,127 [shard 1:main] migration_manager - stopping migration service
INFO 2024-07-20 20:37:12,127 [shard 1:main] migration_manager - emaskovsky: drain(): checkpoint 1
INFO 2024-07-20 20:37:12,127 [shard 1:main] migration_manager - emaskovsky: drain(): checkpoint 2
INFO 2024-07-20 20:37:12,127 [shard 1:main] migration_manager - emaskovsky: drain(): _group0_barrier.join()
(if it would complete, there would be another log emaskovsky: drain(): exit
as seen in the successful cases)
That is at this place in the code: https://github.com/scylladb/scylladb/blob/358147959e3e59718c14c091a832c4cced1e6c83/service/migration_manager.cc#L94
future<> migration_manager::drain()
{
mlogger.info("stopping migration service");
_as.request_abort();
co_await uninit_messaging_service();
try {
co_await coroutine::parallel_for_each(_schema_pulls, [] (auto&& e) {
return e.second.join();
});
} catch (...) {
mlogger.error("schema_pull failed: {}", std::current_exception());
}
co_await _group0_barrier.join(); // <--
co_await _background_tasks.close();
}
it looks like it is trying to join some kind of barrier (maybe read barrier?) but it gets stuck there.
I was first thinking that the _as
not being propagated into some operations might be an issue - for example, the migration_manager::get_schema_for_write()
triggers the group barrier:
https://github.com/scylladb/scylladb/blob/358147959e3e59718c14c091a832c4cced1e6c83/service/migration_manager.cc#L1097
but doesn't use the internal abort source _as
, just the external one (as
) so it should perhaps use a composite abort source (as the internal abort source is checked at the beginning of the function, but there still seems to be a race). However that doesn't seem to cause this particular issue, as there are no logs about the get_schema_for_write()
running at the same time.
it looks like it is trying to join some kind of barrier (maybe read barrier?)
migration_manager::_group0_barrier
is a serialized_action
that batches calls: multiple requests to barrier get turned into a single group 0 read barrier. Once the current barrier finishes and there are requests scheduled, next barrier starts to satisfy these requests, and so on.
However that doesn't seem to cause this particular issue, as there are no logs about the get_schema_for_write() running at the same time.
Doesn't repair actually use this barrier?
Doesn't repair actually use this barrier?
Yes
repair/table_check.cc:
future<table_dropped> table_sync_and_check(replica::database& db, service::migration_manager& mm, const table_id& uuid) {
if (mm.use_raft()) {
// Trigger read barrier to synchronize schema.
co_await mm.get_group0_barrier().trigger(mm.get_abort_source());
}
co_return !db.column_family_exists(uuid);
}
this could be our stuck call
test produced a core: https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release-with-consistent-topology-changes/1980/testReport/junit/repair_additional_test/TestRepairAdditional/FullDtest___full_split000___test_repair_kill_1/
dtest logs tells me shutdown is stuck:
2 minutes and a few seconds ago, I see this in node1 logs:
so shutdown is stuck when stopping migration
maybe a dup of https://github.com/scylladb/scylladb/issues/18320