test_crash_coordinator_before_streaming.test_kill_coordinator_during_op.debug.1 times out

xemul commented 1 month ago

https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/12400/

...
        # kill coordinator during bootstrap
        logger.debug("Kill coordinator during bootstrap")
        nodes = await manager.running_servers()
        coordinator_host = await get_coordinator_host(manager)
        other_nodes = [srv for srv in nodes if srv.server_id != coordinator_host.server_id]
        new_node = await manager.server_add(start=False)
        await manager.api.enable_injection(coordinator_host.ip_addr, "crash_coordinator_before_stream", one_shot=True)
        await manager.server_start(new_node.server_id,
                                   expected_error="Startup failed: std::runtime_error")
        await wait_new_coordinator_elected(manager, 4, time.time() + 60)
>       await manager.server_restart(coordinator_host.server_id, wait_others=1)

test/topology_experimental_raft/test_crash_coordinator_before_streaming.py:99: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test/pylib/manager_client.py:238: in server_restart
    await self.server_start(server_id=server_id, wait_others=wait_others, wait_interval=wait_interval)
test/pylib/manager_client.py:226: in server_start
    await self.client.put_json(f"/cluster/server/{server_id}/start", data, timeout=timeout)
test/pylib/rest_client.py:114: in put_json
    ret = await self._fetch("PUT", resource_uri, response_type = response_type, host = host,
test/pylib/rest_client.py:65: in _fetch
    async with request(method, uri,
/usr/lib64/python3.12/site-packages/aiohttp/client.py:1246: in __aenter__
    self._resp = await self._coro
/usr/lib64/python3.12/site-packages/aiohttp/client.py:608: in _request
    await resp.start(conn)
/usr/lib64/python3.12/site-packages/aiohttp/client_reqrep.py:971: in start
    with self._timer:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <aiohttp.helpers.TimerContext object at 0x7f49a9952000>
exc_type = <class 'asyncio.exceptions.CancelledError'>
exc_val = CancelledError(), exc_tb = <traceback object at 0x7f49a853d740>

    def __exit__(
        self,
        exc_type: Optional[Type[BaseException]],
        exc_val: Optional[BaseException],
        exc_tb: Optional[TracebackType],
    ) -> Optional[bool]:
        if self._tasks:
            self._tasks.pop()

        if exc_type is asyncio.CancelledError and self._cancelled:
>           raise asyncio.TimeoutError from None
E           TimeoutError

several nodes have this crash in their logs

...
DEBUG 2024-10-15 00:24:19,389 [shard 0: gms] raft_topology - obtaining group 0 guard...
DEBUG 2024-10-15 00:24:19,397 [shard 0: gms] raft_topology - guard taken, prev_state_id: ac4539fa-8a72-11ef-f9cf-e340a1016174, new_state_id: ac55bb36-8a72-11ef-9399-9be3ab08e1b9, coordinator term: 3, current Raft term: 3
Aborting on shard 0, in scheduling group gossip.
Backtrace:
  0x3058755
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x3832e17
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x3832aa5
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x3674aeb
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x36aa265
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x3798be9
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x3798f17
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x3798d3e
  /lib64/libc.so.6+0x40cff
  /lib64/libc.so.6+0x99663
  /lib64/libc.so.6+0x40c4d
  /lib64/libc.so.6+0x28901
  0x8f17db2
  0x8e91f30
  0x433d036
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x36974ce
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x36a34bd
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x36a8990
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x36a6969
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x31f86f4
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x31f508a
  0x30f231b
  0x30efef3
  /lib64/libc.so.6+0x2a087
  /lib64/libc.so.6+0x2a14a
  0x30144a4

decoded

Backtrace:
[Backtrace #0]
__interceptor_backtrace at /mnt/clang_build/llvm-project-x86_64/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:4358
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at ./build/debug/seastar/./seastar/include/seastar/util/backtrace.hh:68
seastar::backtrace_buffer::append_backtrace() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:825
seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:858
seastar::print_with_backtrace(char const*, bool) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:870
seastar::sigabrt_action() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:4003
seastar::install_oneshot_signal_handler<6, (void (*)())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t*, void*)#1}::operator()(int, siginfo_t*, void*) const at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3980
seastar::install_oneshot_signal_handler<6, (void (*)())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t*, void*)#1}::__invoke(int, siginfo_t*, void*) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3975
/data/scylla-s3-reloc.cache/by-build-id/0f90c478f81b9f37cd9245207e200a3ec986cc50/extracted/scylla/libreloc/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=77c77fee058b19c6f001cf2cb0371ce3b8341211, for GNU/Linux 3.2.0, not stripped

__GI___sigaction at :?
__pthread_kill_implementation at ??:?
__GI_raise at :?
__GI_abort at :?
service::topology_coordinator::handle_topology_transition(service::group0_guard)::{lambda()#1}::operator()() const at ././service/topology_coordinator.cc:1872
 (inlined by) void std::__invoke_impl<void, service::topology_coordinator::handle_topology_transition(service::group0_guard)::{lambda()#1}&>(std::__invoke_other, service::topology_coordinator::handle_topology_transition(service::group0_guard)::{lambda()#1}&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:61
 (inlined by) std::enable_if<is_invocable_r_v<void, service::topology_coordinator::handle_topology_transition(service::group0_guard)::{lambda()#1}&>, void>::type std::__invoke_r<void, service::topology_coordinator::handle_topology_transition(service::group0_guard)::{lambda()#1}&>(service::topology_coordinator::handle_topology_transition(service::group0_guard)::{lambda()#1}&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:111
 (inlined by) std::_Function_handler<void (), service::topology_coordinator::handle_topology_transition(service::group0_guard)::{lambda()#1}>::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:290
utils::error_injection<true>::inject(std::basic_string_view<char, std::char_traits<char> > const&, std::function<void ()>) at ././utils/error_injection.hh:373
 (inlined by) service::topology_coordinator::handle_topology_transition(service::group0_guard) at ././service/topology_coordinator.cc:1872
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<bool>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
 (inlined by) seastar::internal::coroutine_traits_base<bool>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:2621
seastar::reactor::run_some_tasks() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3087
seastar::reactor::do_run() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3255
seastar::reactor::run() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3145
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/app-template.cc:276
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/app-template.cc:167
scylla_main(int, char**) at ././main.cc:703
main at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:591
__libc_start_call_main at ??:?
__libc_start_main_alias_2 at :?
_start at ??:?

kbr-scylla commented 1 month ago

several nodes have this crash in their logs

service::topology_coordinator::handle_topology_transition(service::group0_guard)::{lambda()#1}::operator()() const at ././service/topology_coordinator.cc:1872

Nothing unexpected here:

                utils::get_local_injector().inject("crash_coordinator_before_stream", [] { abort(); });

kbr-scylla commented 1 month ago

test_kill_coordinator_during_op.1.debug.1.zip

node-4165 was trying to restart It began restarting at 00:24:27:

WARN  2024-10-15 00:24:27,745 seastar - Seastar compiled with default allocator, --memory option won't take effect

last log message was at 00:24:40:

INFO  2024-10-15 00:24:40,859 [shard 0:comp] compaction - [Compact system.peers b91439b0-8a72-11ef-834b-50ec3982def2] Compacted 4 sstables to [/scylladir/testlog/x86_64/debug/scylla-4165/data/system/peers-37f71aca7dc2383ba70672528af04d4f/me-3gkd_1nh4_4p3yo1zvma01fxspv6-big-Data.db:level=0]. 38kB to 9489 bytes (~24% of original) in 68ms = 562kB/s. ~512 total partitions merged to 5.

so it got stuck 13 seconds into the restart procedure.

Last message from storage_service is:

INFO  2024-10-15 00:24:38,026 [shard 0:strm] storage_service - The node is already in group 0 and will restart in raft mode

less than a second earlier it started reloading topology state:

DEBUG 2024-10-15 00:24:37,989 [shard 0: gms] raft_topology - reload raft topology state

it's unclear if this operation finished; did it get stuck here?

Annamikhlin commented 1 month ago

seen also on https://jenkins.scylladb.com/job/scylla-enterprise/job/next/3179/testReport/junit/(root)/test_crash_coordinator_before_streaming/Build___x86___Unit_Tests_x86___test_kill_coordinator_during_op_debug_3_2/

kbr-scylla commented 1 month ago

The next message after

INFO  2024-10-15 00:24:38,026 [shard 0:strm] storage_service - The node is already in group 0 and will restart in raft mode

should be "Performing gossip shadow round...". We can see an example of this in another node that restarted earlier:

INFO  2024-10-15 00:23:48,960 [shard 0:strm] storage_service - The node is already in group 0 and will restart in raft mode
INFO  2024-10-15 00:23:48,960 [shard 0:strm] storage_service - Performing gossip shadow round, initial_contact_nodes={127.153.192.27, 127.153.192.9, 127.153.192.17, 127.153.192.36}

so our node got stuck somewhere between these two messages.

In the code it's between

    } else if (_group0->joined_group0()) {
        // We are a part of group 0. The _topology_change_kind_enabled flag is maintained from there.
        _manage_topology_change_kind_from_group0 = true;
        set_topology_change_kind(upgrade_state_to_topology_op_kind(_topology_state_machine._topology.upgrade_state));
        if (_db.local().get_config().force_gossip_topology_changes() && raft_topology_change_enabled()) {
            throw std::runtime_error("Cannot force gossip topology changes - the cluster is using raft-based topology");
        }
        slogger.info("The node is already in group 0 and will restart in {} mode", raft_topology_change_enabled() ? "raft" : "legacy");

in join_cluster, and

    } else {
        auto local_features = _feature_service.supported_feature_set();
        slogger.info("Performing gossip shadow round, initial_contact_nodes={}", initial_contact_nodes);

in join_topology.

The only place I see where it could get stuck is here:

    auto tmlock = std::make_unique<token_metadata_lock>(co_await get_token_metadata_lock());
    auto tmptr = co_await get_mutable_token_metadata_ptr();

indeed, it's plausible that it's getting stuck here --- since it printed reload raft topology state just a second earlier; so it could be that it's stuck somewhere inside topology_state_load which is also taking the token_metadata lock. The question is where.

kbr-scylla commented 1 month ago

The new topology coordinator (scylla-4174) is not doing anything during this time.

DEBUG 2024-10-15 00:24:24,959 [shard 0: gms] raft_topology - topology coordinator fiber has nothing to do. Sleeping.
INFO  2024-10-15 00:24:37,950 [shard 0:main] raft_group_registry - marking Raft server 9211b09b-4889-4442-a2cc-c86847b791b6 as alive for raft groups
INFO  2024-10-15 00:25:23,770 [shard 0: gms] raft_topology - raft topology: Refreshing table load stats for DC datacenter1 that has 5 token owners
DEBUG 2024-10-15 00:25:53,771 [shard 0: gms] raft_topology - raft topology: Tablet load stats refresher aborted
INFO  2024-10-15 00:25:53,771 [shard 0: gms] raft_topology - raft topology: Refreshing table load stats for DC datacenter1 that has 5 token owners
DEBUG 2024-10-15 00:26:23,771 [shard 0: gms] raft_topology - raft topology: Tablet load stats refresher aborted
INFO  2024-10-15 00:26:23,771 [shard 0: gms] raft_topology - raft topology: Refreshing table load stats for DC datacenter1 that has 5 token owners
DEBUG 2024-10-15 00:26:53,771 [shard 0: gms] raft_topology - raft topology: Tablet load stats refresher aborted
INFO  2024-10-15 00:26:53,771 [shard 0: gms] raft_topology - raft topology: Refreshing table load stats for DC datacenter1 that has 5 token owners
DEBUG 2024-10-15 00:27:23,772 [shard 0: gms] raft_topology - raft topology: Tablet load stats refresher aborted
INFO  2024-10-15 00:27:23,772 [shard 0: gms] raft_topology - raft topology: Refreshing table load stats for DC datacenter1 that has 5 token owners
DEBUG 2024-10-15 00:27:53,772 [shard 0: gms] raft_topology - raft topology: Tablet load stats refresher aborted
INFO  2024-10-15 00:27:53,772 [shard 0: gms] raft_topology - raft topology: Refreshing table load stats for DC datacenter1 that has 5 token owners
DEBUG 2024-10-15 00:28:23,772 [shard 0: gms] raft_topology - raft topology: Tablet load stats refresher aborted
INFO  2024-10-15 00:28:23,773 [shard 0: gms] raft_topology - raft topology: Refreshing table load stats for DC datacenter1 that has 5 token owners
DEBUG 2024-10-15 00:28:53,773 [shard 0: gms] raft_topology - raft topology: Tablet load stats refresher aborted
INFO  2024-10-15 00:28:53,773 [shard 0: gms] raft_topology - raft topology: Refreshing table load stats for DC datacenter1 that has 5 token owners
DEBUG 2024-10-15 00:29:23,773 [shard 0: gms] raft_topology - raft topology: Tablet load stats refresher aborted
INFO  2024-10-15 00:29:23,773 [shard 0: gms] raft_topology - raft topology: Refreshing table load stats for DC datacenter1 that has 5 token owners

it looks like we have some kind of deadlock inside topology_state_load. cc @gleb-cloudius

kbr-scylla commented 1 month ago

I sent https://github.com/scylladb/scylladb/pull/21247 to help the investigation further once the problem reproduces again. (I'm trying to reproduce locally, but as usual, it's not so easy)

kbr-scylla commented 1 month ago

Idea for further debugging:

determine where exactly the coordinator node was killed during the bootstrap process. Maybe it needs to be restarted in a very specific place during the procedure for the hang to happen. Orchestrate the timing to reproduce this
orchestrate the timing so topology_state_load takes lock first on restart (before join_topology does)

kbr-scylla commented 4 weeks ago

Reproduced with tactical sleep

diff --git a/service/storage_service.cc b/service/storage_service.cc
index cf7d75b082..af2be00da2 100644
--- a/service/storage_service.cc
+++ b/service/storage_service.cc
@@ -2997,6 +2997,10 @@ future<> storage_service::join_cluster(sharded<db::system_distributed_keyspace>&
         co_await replicate_to_all_cores(std::move(tmptr));
     }

+    slogger.info("Sleep before load peer features");
+    co_await utils::get_local_injector().inject("sleep_before_load_peer_features", std::chrono::seconds{3});
+    slogger.info("DONE Sleep before load peer features");
+
     // Seeds are now only used as the initial contact point nodes. If the
     // loaded_endpoints are empty which means this node is a completely new
     // node, we use the nodes specified in seeds as the initial contact
diff --git a/test/topology_experimental_raft/test_crash_coordinator_before_streaming.py b/test/topology_experimental_raft/test_crash_coordinator_before_streaming.py
index b61b0337c0..265041ba95 100644
--- a/test/topology_experimental_raft/test_crash_coordinator_before_streaming.py
+++ b/test/topology_experimental_raft/test_crash_coordinator_before_streaming.py
@@ -99,6 +99,11 @@ async def test_kill_coordinator_during_op(manager: ManagerClient) -> None:
     await manager.server_start(new_node.server_id,
                                expected_error="Startup failed: std::runtime_error")
     await wait_new_coordinator_elected(manager, 4, time.time() + 60)
+    logger.info(f"Restarting {coordinator_host}")
+    await manager.server_stop(coordinator_host.server_id)
+    await manager.server_update_config(coordinator_host.server_id, 'error_injections_at_startup', [
+        "sleep_before_load_peer_features",
+    ])
     await manager.server_restart(coordinator_host.server_id, wait_others=1)
     await manager.servers_see_each_other(await manager.running_servers())
     await check_token_ring_and_group0_consistency(manager)

kbr-scylla commented 4 weeks ago

The sleep ensures that the last reload raft topology state happens before storage_service - The node is already in group 0 and will restart in raft mode

kbr-scylla commented 4 weeks ago

We're stuck in process_left_node -> remove_ip -> gossiper::force_remove_endpoint. The node we're processing is the one that just failed to bootstrap.

kbr-scylla commented 4 weeks ago

@gleb-cloudius please continue this investigation. The problem is now 100% reproducible with that sleep I posted

gleb-cloudius commented 3 weeks ago

several nodes have this crash in their logs

service::topology_coordinator::handle_topology_transition(service::group0_guard)::{lambda()#1}::operator()() const at ././service/topology_coordinator.cc:1872

Nothing unexpected here:
            utils::get_local_injector().inject("crash_coordinator_before_stream", [] { abort(); });

It should be exit() not abort(). It just creates cores that waste disk space.

gleb-cloudius commented 3 weeks ago

The hang is in storage_service::on_remove since at the time it is called the mode is still not set to raft on startup (because of the delay) and it tries to do something there with token metadata which it should not touch since it is managed by raft by that point. The problem is which this upgrade code which I cannot grasp at all. It sets correct mode too late. On a regular reboot. Why it even tries to figure something complicated out on reboot? It should be straight read from a local table as early as possible.

@piodul

It also looks like if (_db.local().get_config().load_ring_state() && !raft_topology_change_enabled()) { at the beginning of storage_service::join_cluster is always true (or rather raft_topology_change_enabled there is always false) since we set the mode only later.

piodul commented 3 weeks ago

The intended purpose of the _manage_topology_change_kind_from_group0 was to determine who is in control of the _topology_change_kind_enabled flag.

If it is set to true, then its value reflects the upgrade_state column in system.topology,
If it is set to false, then it can be set manually by initial startup code.

_manage_topology_change_kind_from_group0 starts as false, but is set to true during startup only after we are sure that we joined group0 and we performed a read barrier (in case this is the first mode, we also require that the initial state of topology was written to group0).

The "manual mode" is only used if the node is joining the cluster, and switched to the other mode after joining group 0.

I'm not 100% sure I remember the exact reason for why I introduced _manage_topology_change_kind_from_group0, but I think it was about preventing the in-memory _topology_change_kind_enabled flag from being set to wrong value. For example, when starting a single-node cluster it starts with an empty group0 state which is only initialized in raft_initialize_discovery_leader. A NULL value of upgrade_state is interpreted as "legacy topology", so the in-memory value could be temporarily wrong - actually, there is a FIXME near the definition of _manage_topology_change_kind_from_group0 exactly about this.

gleb-cloudius commented 3 weeks ago

The intended purpose of the _manage_topology_change_kind_from_group0 was to determine who is in control of the _topology_change_kind_enabled flag.
* If it is set to true, then its value reflects the `upgrade_state` column in `system.topology`,

* If it is set to false, then it can be set manually by initial startup code.
_manage_topology_change_kind_from_group0 starts as false, but is set to true during startup only after we are sure that we joined group0 and we performed a read barrier (in case this is the first mode, we also require that the initial state of topology was written to group0).

But according to the bug here this is too late. If we are a part of group0 it means we can execute it's apply before setting the variable (since we set it after starting the group) which in turn will execute load topology code that uses the code that relies on knowing what mode we are in.

The question is why don't we just read the current mode on reboot from local table as early as possible and try to figure it out instead? If mode is not there yet (as will be the case on a first boot) we go through the trouble of figuring it out.

piodul commented 3 weeks ago

But according to the bug here this is too late. If we are a part of group0 it means we can execute it's apply before setting the variable (since we set it after starting the group) which in turn will execute load topology code that uses the code that relies on knowing what mode we are in.

The question is why don't we just read the current mode on reboot from local table as early as possible and try to figure it out instead? If mode is not there yet (as will be the case on a first boot) we go through the trouble of figuring it out.

Yes, I think that setting the mode and the _manage_topology_change_kind_from_group0 flag in raft_group0::setup_group0_if_exist in case group 0 exists would work.

gleb-cloudius commented 3 weeks ago

But according to the bug here this is too late. If we are a part of group0 it means we can execute it's apply before setting the variable (since we set it after starting the group) which in turn will execute load topology code that uses the code that relies on knowing what mode we are in. The question is why don't we just read the current mode on reboot from local table as early as possible and try to figure it out instead? If mode is not there yet (as will be the case on a first boot) we go through the trouble of figuring it out.

Yes, I think that setting the mode and the _manage_topology_change_kind_from_group0 flag in raft_group0::setup_group0_if_exist in case group 0 exists would work.

How? group0 may exist but do not manage topology. May be we can read upgrade_state from the topology before even trying to start group0 and set based on the result.

kbr-scylla commented 1 week ago

ping @gleb-cloudius what's the status of this issue?

gleb-cloudius commented 1 week ago

ping @gleb-cloudius what's the status of this issue?

Still waiting on @piodul to explain how it all suppose to work. I do not see how what he propose in https://github.com/scylladb/scylladb/issues/21114#issuecomment-2444580143 would work.

piodul commented 1 week ago

How? group0 may exist but do not manage topology. May be we can read upgrade_state from the topology before even trying to start group0 and set based on the result.

Yes, this should work. If bootstrap has completed then the value of upgrade_state in the local system.topology table can be used for that purpose.

scylladb / scylladb

test_crash_coordinator_before_streaming.test_kill_coordinator_during_op.debug.1 times out #21114