gossiper failure detector: marking crashed node as DOWN races with marking it as UP when it restarts; ends up being DOWN

kbr-scylla commented 4 months ago

Test by @aleksbykov https://github.com/kbr-scylla/scylladb/commit/563af5fa3c16d0a0be40426c33454d83cc1c22ea

Failed on this phase:

        # kill coordinator during replace
        logger.debug("Kill coordinator during replace")
        nodes = await manager.running_servers()
        coordinator_host = await get_coordinator_host(manager)
        other_nodes = [srv for srv in nodes if srv.server_id != coordinator_host.server_id]
        node_to_replace_srv_id = other_nodes[-1].server_id
        await manager.server_stop_gracefully(node_to_replace_srv_id)
        await manager.api.enable_injection(coordinator_host.ip_addr, "crash_coordinator_before_stream", one_shot=True)
        replace_cfg = ReplaceConfig(replaced_id = node_to_replace_srv_id, reuse_ip_addr = False, use_host_id = True)
        new_node = await manager.server_add(start=False, replace_cfg=replace_cfg)
        await manager.server_start(new_node.server_id, expected_error="Replace failed. See earlier errors")
        await wait_new_coordinator_elected(manager, 5, time.time() + 60)
        logger.debug("Start old coordinator node")
        await manager.server_restart(coordinator_host.server_id)
        await manager.servers_see_each_other(await manager.running_servers())
        logger.debug("Replaced node is already non-voter and will be banned after restart. Remove it")
        coordinator_host = await get_coordinator_host(manager)
>       await manager.remove_node(coordinator_host.server_id, node_to_replace_srv_id)

Uploaded logs from last failure:

scylla-3131.log scylla-3132.log scylla-3133.log scylla-3136.log scylla-3137.log scylla-3154.log scylla-3156.log scylla-3168.log topology_experimental_raft.test_crash_coordinator_before_streaming.1.log

In short, this is what happened:

we shutdown a node
we start replacing it
error injection kills topology coordinator in the middle, causing replace to fail
we wait for new coordinator to get elected
we restart the previously killed node
then we call removenode for the node that was shutdown in first step
removenode fails, because some nodes think that the node that was killed then restarted is DOWN.

Even though the test contains

        await manager.servers_see_each_other(await manager.running_servers())

before doing removenode.

Apparently, there is some race inside gossiper failure detection code:

failure detector loop on node A tries to reach killed node B, gets rpc closed_error, decides to mark it as DOWN
node B restarts
A notices that B restarted and marks it as UP
but then the failure detector loop on node A finishes and marks B as DOWN

In the above run we saw it on scylla-3132

ERROR 2024-04-15 18:03:23,747 [shard 0:main] rpc - client 127.18.81.45:7000: client connection dropped: recv: Connection reset by peer
ERROR 2024-04-15 18:03:23,748 [shard 0:main] rpc - client 127.18.81.45:7000: client connection dropped: recv: Connection reset by peer
WARN  2024-04-15 18:03:23,765 [shard 0:strm] gossip - failure_detector_loop: Send echo to node 127.18.81.45, status = failed: seastar::rpc::closed_error (connection is closed)
WARN  2024-04-15 18:03:23,765 [shard 1:strm] gossip - failure_detector_loop: Send echo to node 127.18.81.45, status = failed: seastar::rpc::closed_error (connection is closed)
...
WARN  2024-04-15 18:03:25,765 [shard 1:strm] gossip - failure_detector_loop: Send echo to node 127.18.81.45, status = failed: seastar::rpc::closed_error (connection is closed)
...
INFO  2024-04-15 18:03:27,662 [shard 0: gms] gossip - Node 127.18.81.45 has restarted, now UP, status = NORMAL
...
INFO  2024-04-15 18:03:27,766 [shard 1:strm] gossip - failure_detector_loop: Mark node 127.18.81.45 as DOWN
INFO  2024-04-15 18:03:27,766 [shard 0:strm] gossip - InetAddress e44f5621-6925-4b8c-9976-de937c70961d/127.18.81.45 is now DOWN, status = NORMAL

The issue doesn't look critical.

kbr-scylla commented 4 months ago

Possibly related: 2d9e78b09af14c3a2acbd3af7c8910e514895d36 (cc @piodul )

piodul commented 4 months ago

Possibly related: 2d9e78b (cc @piodul )

I don't think the commit causes this issue. This looks like a different kind of race, but in the same area of gossiper (node-specific FD loop marking as down vs. gossiper exchange marking as up).

kbr-scylla commented 4 months ago

In any case, the way the test is written, there could be another race, and that other race is something that inherently cannot be fixed.

Namely, the test may start removenode (and enter repair stage) before the killed node is even marked as DOWN.

Then the killed node is marked as DOWN in the middle of repair. And even if it gets later marked as UP (so the markings happen in correct order), repair is interrupted.

So while there does seem to be some kind of race in gossiper, causing DOWN to happen after UP (reversed order of notifications), this scenario will not work in general.

kostja commented 4 months ago

@gleb-cloudius let's store up/down states in raft, please!

gleb-cloudius commented 4 months ago

@gleb-cloudius let's store up/down states in raft, please!

So your idea to fix a wrong failure detector decision is to make it global?

kostja commented 4 months ago

It's not wrong, it's arriving out of order, it's easiest to fix that by linearizing the state changes with Raft.

gleb-cloudius commented 4 months ago

It's not wrong, it's arriving out of order, it's easiest to fix that by linearizing the state changes with Raft.

What arriving out of order? Local failure detector decision? Take a lock. You do not need raft to solve a race between two local fibers.

scylladb / scylladb

gossiper failure detector: marking crashed node as DOWN races with marking it as UP when it restarts; ends up being DOWN #18257