Open xemul opened 1 month ago
several nodes have this crash in their logs
service::topology_coordinator::handle_topology_transition(service::group0_guard)::{lambda()#1}::operator()() const at ././service/topology_coordinator.cc:1872
Nothing unexpected here:
utils::get_local_injector().inject("crash_coordinator_before_stream", [] { abort(); });
test_kill_coordinator_during_op.1.debug.1.zip
node-4165 was trying to restart
It began restarting at 00:24:27
:
WARN 2024-10-15 00:24:27,745 seastar - Seastar compiled with default allocator, --memory option won't take effect
last log message was at 00:24:40
:
INFO 2024-10-15 00:24:40,859 [shard 0:comp] compaction - [Compact system.peers b91439b0-8a72-11ef-834b-50ec3982def2] Compacted 4 sstables to [/scylladir/testlog/x86_64/debug/scylla-4165/data/system/peers-37f71aca7dc2383ba70672528af04d4f/me-3gkd_1nh4_4p3yo1zvma01fxspv6-big-Data.db:level=0]. 38kB to 9489 bytes (~24% of original) in 68ms = 562kB/s. ~512 total partitions merged to 5.
so it got stuck 13
seconds into the restart procedure.
Last message from storage_service
is:
INFO 2024-10-15 00:24:38,026 [shard 0:strm] storage_service - The node is already in group 0 and will restart in raft mode
less than a second earlier it started reloading topology state:
DEBUG 2024-10-15 00:24:37,989 [shard 0: gms] raft_topology - reload raft topology state
it's unclear if this operation finished; did it get stuck here?
The next message after
INFO 2024-10-15 00:24:38,026 [shard 0:strm] storage_service - The node is already in group 0 and will restart in raft mode
should be "Performing gossip shadow round...". We can see an example of this in another node that restarted earlier:
INFO 2024-10-15 00:23:48,960 [shard 0:strm] storage_service - The node is already in group 0 and will restart in raft mode
INFO 2024-10-15 00:23:48,960 [shard 0:strm] storage_service - Performing gossip shadow round, initial_contact_nodes={127.153.192.27, 127.153.192.9, 127.153.192.17, 127.153.192.36}
so our node got stuck somewhere between these two messages.
In the code it's between
} else if (_group0->joined_group0()) {
// We are a part of group 0. The _topology_change_kind_enabled flag is maintained from there.
_manage_topology_change_kind_from_group0 = true;
set_topology_change_kind(upgrade_state_to_topology_op_kind(_topology_state_machine._topology.upgrade_state));
if (_db.local().get_config().force_gossip_topology_changes() && raft_topology_change_enabled()) {
throw std::runtime_error("Cannot force gossip topology changes - the cluster is using raft-based topology");
}
slogger.info("The node is already in group 0 and will restart in {} mode", raft_topology_change_enabled() ? "raft" : "legacy");
in join_cluster
, and
} else {
auto local_features = _feature_service.supported_feature_set();
slogger.info("Performing gossip shadow round, initial_contact_nodes={}", initial_contact_nodes);
in join_topology
.
The only place I see where it could get stuck is here:
auto tmlock = std::make_unique<token_metadata_lock>(co_await get_token_metadata_lock());
auto tmptr = co_await get_mutable_token_metadata_ptr();
indeed, it's plausible that it's getting stuck here --- since it printed reload raft topology state
just a second earlier; so it could be that it's stuck somewhere inside topology_state_load
which is also taking the token_metadata lock. The question is where.
The new topology coordinator (scylla-4174) is not doing anything during this time.
DEBUG 2024-10-15 00:24:24,959 [shard 0: gms] raft_topology - topology coordinator fiber has nothing to do. Sleeping.
INFO 2024-10-15 00:24:37,950 [shard 0:main] raft_group_registry - marking Raft server 9211b09b-4889-4442-a2cc-c86847b791b6 as alive for raft groups
INFO 2024-10-15 00:25:23,770 [shard 0: gms] raft_topology - raft topology: Refreshing table load stats for DC datacenter1 that has 5 token owners
DEBUG 2024-10-15 00:25:53,771 [shard 0: gms] raft_topology - raft topology: Tablet load stats refresher aborted
INFO 2024-10-15 00:25:53,771 [shard 0: gms] raft_topology - raft topology: Refreshing table load stats for DC datacenter1 that has 5 token owners
DEBUG 2024-10-15 00:26:23,771 [shard 0: gms] raft_topology - raft topology: Tablet load stats refresher aborted
INFO 2024-10-15 00:26:23,771 [shard 0: gms] raft_topology - raft topology: Refreshing table load stats for DC datacenter1 that has 5 token owners
DEBUG 2024-10-15 00:26:53,771 [shard 0: gms] raft_topology - raft topology: Tablet load stats refresher aborted
INFO 2024-10-15 00:26:53,771 [shard 0: gms] raft_topology - raft topology: Refreshing table load stats for DC datacenter1 that has 5 token owners
DEBUG 2024-10-15 00:27:23,772 [shard 0: gms] raft_topology - raft topology: Tablet load stats refresher aborted
INFO 2024-10-15 00:27:23,772 [shard 0: gms] raft_topology - raft topology: Refreshing table load stats for DC datacenter1 that has 5 token owners
DEBUG 2024-10-15 00:27:53,772 [shard 0: gms] raft_topology - raft topology: Tablet load stats refresher aborted
INFO 2024-10-15 00:27:53,772 [shard 0: gms] raft_topology - raft topology: Refreshing table load stats for DC datacenter1 that has 5 token owners
DEBUG 2024-10-15 00:28:23,772 [shard 0: gms] raft_topology - raft topology: Tablet load stats refresher aborted
INFO 2024-10-15 00:28:23,773 [shard 0: gms] raft_topology - raft topology: Refreshing table load stats for DC datacenter1 that has 5 token owners
DEBUG 2024-10-15 00:28:53,773 [shard 0: gms] raft_topology - raft topology: Tablet load stats refresher aborted
INFO 2024-10-15 00:28:53,773 [shard 0: gms] raft_topology - raft topology: Refreshing table load stats for DC datacenter1 that has 5 token owners
DEBUG 2024-10-15 00:29:23,773 [shard 0: gms] raft_topology - raft topology: Tablet load stats refresher aborted
INFO 2024-10-15 00:29:23,773 [shard 0: gms] raft_topology - raft topology: Refreshing table load stats for DC datacenter1 that has 5 token owners
it looks like we have some kind of deadlock inside topology_state_load
. cc @gleb-cloudius
I sent https://github.com/scylladb/scylladb/pull/21247 to help the investigation further once the problem reproduces again. (I'm trying to reproduce locally, but as usual, it's not so easy)
Idea for further debugging:
Reproduced with tactical sleep
diff --git a/service/storage_service.cc b/service/storage_service.cc
index cf7d75b082..af2be00da2 100644
--- a/service/storage_service.cc
+++ b/service/storage_service.cc
@@ -2997,6 +2997,10 @@ future<> storage_service::join_cluster(sharded<db::system_distributed_keyspace>&
co_await replicate_to_all_cores(std::move(tmptr));
}
+ slogger.info("Sleep before load peer features");
+ co_await utils::get_local_injector().inject("sleep_before_load_peer_features", std::chrono::seconds{3});
+ slogger.info("DONE Sleep before load peer features");
+
// Seeds are now only used as the initial contact point nodes. If the
// loaded_endpoints are empty which means this node is a completely new
// node, we use the nodes specified in seeds as the initial contact
diff --git a/test/topology_experimental_raft/test_crash_coordinator_before_streaming.py b/test/topology_experimental_raft/test_crash_coordinator_before_streaming.py
index b61b0337c0..265041ba95 100644
--- a/test/topology_experimental_raft/test_crash_coordinator_before_streaming.py
+++ b/test/topology_experimental_raft/test_crash_coordinator_before_streaming.py
@@ -99,6 +99,11 @@ async def test_kill_coordinator_during_op(manager: ManagerClient) -> None:
await manager.server_start(new_node.server_id,
expected_error="Startup failed: std::runtime_error")
await wait_new_coordinator_elected(manager, 4, time.time() + 60)
+ logger.info(f"Restarting {coordinator_host}")
+ await manager.server_stop(coordinator_host.server_id)
+ await manager.server_update_config(coordinator_host.server_id, 'error_injections_at_startup', [
+ "sleep_before_load_peer_features",
+ ])
await manager.server_restart(coordinator_host.server_id, wait_others=1)
await manager.servers_see_each_other(await manager.running_servers())
await check_token_ring_and_group0_consistency(manager)
The sleep ensures that the last reload raft topology state
happens before storage_service - The node is already in group 0 and will restart in raft mode
We're stuck in process_left_node
-> remove_ip
-> gossiper::force_remove_endpoint
.
The node we're processing is the one that just failed to bootstrap.
@gleb-cloudius please continue this investigation. The problem is now 100% reproducible with that sleep I posted
several nodes have this crash in their logs
service::topology_coordinator::handle_topology_transition(service::group0_guard)::{lambda()#1}::operator()() const at ././service/topology_coordinator.cc:1872
Nothing unexpected here:
utils::get_local_injector().inject("crash_coordinator_before_stream", [] { abort(); });
It should be exit()
not abort()
. It just creates cores that waste disk space.
The hang is in storage_service::on_remove
since at the time it is called the mode is still not set to raft on startup (because of the delay) and it tries to do something there with token metadata which it should not touch since it is managed by raft by that point. The problem is which this upgrade code which I cannot grasp at all. It sets correct mode too late. On a regular reboot. Why it even tries to figure something complicated out on reboot? It should be straight read from a local table as early as possible.
@piodul
It also looks like if (_db.local().get_config().load_ring_state() && !raft_topology_change_enabled()) {
at the beginning of storage_service::join_cluster
is always true (or rather raft_topology_change_enabled there is always false) since we set the mode only later.
The intended purpose of the _manage_topology_change_kind_from_group0
was to determine who is in control of the _topology_change_kind_enabled
flag.
upgrade_state
column in system.topology
,_manage_topology_change_kind_from_group0
starts as false, but is set to true during startup only after we are sure that we joined group0 and we performed a read barrier (in case this is the first mode, we also require that the initial state of topology was written to group0).
The "manual mode" is only used if the node is joining the cluster, and switched to the other mode after joining group 0.
I'm not 100% sure I remember the exact reason for why I introduced _manage_topology_change_kind_from_group0
, but I think it was about preventing the in-memory _topology_change_kind_enabled
flag from being set to wrong value. For example, when starting a single-node cluster it starts with an empty group0 state which is only initialized in raft_initialize_discovery_leader
. A NULL value of upgrade_state
is interpreted as "legacy topology", so the in-memory value could be temporarily wrong - actually, there is a FIXME near the definition of _manage_topology_change_kind_from_group0
exactly about this.
The intended purpose of the
_manage_topology_change_kind_from_group0
was to determine who is in control of the_topology_change_kind_enabled
flag.* If it is set to true, then its value reflects the `upgrade_state` column in `system.topology`, * If it is set to false, then it can be set manually by initial startup code.
_manage_topology_change_kind_from_group0
starts as false, but is set to true during startup only after we are sure that we joined group0 and we performed a read barrier (in case this is the first mode, we also require that the initial state of topology was written to group0).
But according to the bug here this is too late. If we are a part of group0 it means we can execute it's apply before setting the variable (since we set it after starting the group) which in turn will execute load topology code that uses the code that relies on knowing what mode we are in.
The question is why don't we just read the current mode on reboot from local table as early as possible and try to figure it out instead? If mode is not there yet (as will be the case on a first boot) we go through the trouble of figuring it out.
But according to the bug here this is too late. If we are a part of group0 it means we can execute it's apply before setting the variable (since we set it after starting the group) which in turn will execute load topology code that uses the code that relies on knowing what mode we are in.
The question is why don't we just read the current mode on reboot from local table as early as possible and try to figure it out instead? If mode is not there yet (as will be the case on a first boot) we go through the trouble of figuring it out.
Yes, I think that setting the mode and the _manage_topology_change_kind_from_group0
flag in raft_group0::setup_group0_if_exist
in case group 0 exists would work.
But according to the bug here this is too late. If we are a part of group0 it means we can execute it's apply before setting the variable (since we set it after starting the group) which in turn will execute load topology code that uses the code that relies on knowing what mode we are in. The question is why don't we just read the current mode on reboot from local table as early as possible and try to figure it out instead? If mode is not there yet (as will be the case on a first boot) we go through the trouble of figuring it out.
Yes, I think that setting the mode and the
_manage_topology_change_kind_from_group0
flag inraft_group0::setup_group0_if_exist
in case group 0 exists would work.
How? group0 may exist but do not manage topology. May be we can read upgrade_state
from the topology before even trying to start group0 and set based on the result.
ping @gleb-cloudius what's the status of this issue?
ping @gleb-cloudius what's the status of this issue?
Still waiting on @piodul to explain how it all suppose to work. I do not see how what he propose in https://github.com/scylladb/scylladb/issues/21114#issuecomment-2444580143 would work.
How? group0 may exist but do not manage topology. May be we can read
upgrade_state
from the topology before even trying to start group0 and set based on the result.
Yes, this should work. If bootstrap has completed then the value of upgrade_state
in the local system.topology
table can be used for that purpose.
https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/12400/
several nodes have this crash in their logs
decoded