check_node_status_in_gossip_and_nodetool_status generates unclear error on non-graceful shutdown

bhalevy commented 10 months ago

When a node goes down non-gracefully, e.g. of an instance is terminated, or the server process is terminated and it takes long enough until it comes back up, other nodes will report it as DN in nodetool status, but its STATUS in nodetool gossipinfo will be reported as NORMAL. This is the excepted gossip behavior. The reason is that in graceful shutdown, the node going down sends a gossip shudown RPC to the othernodes to let them know it is shutting down, so they change its gossip STATUS to shutdown,true, but otherwise it remains as NORMAL.

Seen in https://argus.scylladb.com/test/0a6e192a-1bdb-4b49-b6c2-e796ac5d1c99/runs?additionalRuns%5B%5D=823076dd-b8a2-46c4-b8a9-3e2c2faa10cc:

2023-11-09 17:37:51.324: (ClusterHealthValidatorEvent Severity.ERROR) period_type=one-time event_id=de6c8e61-36d9-4615-9f87-eab30182dea2: type=NodeStatus node=longevity-large-partitions-200k-pks-db-node-823076dd-1 error=Current node Node longevity-large-partitions-200k-pks-db-node-823076dd-1 [34.240.27.150 | 10.4.6.234] (seed: True). Wrong node status. Node Node longevity-large-partitions-200k-pks-db-node-823076dd-5 [34.254.94.8 | 10.4.6.250] (seed: True) status in nodetool.status is DN, but status in gossip NORMAL

In this case node-5 disappeared unexpectedly, and it is unclear from the error message without understanding the low-level details.

Also, if we have a "kill" nemesis where nodes are taken down ungracefully this error will be triggered, yet it should be expected.

I think that if the test has a model that can tell what is the expected status in nodetool status for each node, it would be enough for a check_node_status function. Although node gossipinfo is an official interface, I'm afraid it's tool low-level and subject to change to be able to count on it for system-level tests such as SCT.

If and when we move all topology decisions to raft we should be testing the raft status instead of the gossip status since it'd be more reliable, and can be exposed as an official interface to the user.

Cc @asias @kostja @kbr-scylla @tgrabiec

mykaul commented 9 months ago

@fruch - who should own this item?

fruch commented 9 months ago

I don't know exactly what is expected here

To not use gossipinfo ? To check raft status ?

@temichus @aleksbykov, I think checking raft group0 was already introduced, right ?

Also if @bhalevy has tagged the people at the, I would expect some of them to care, and decide what is needed.

aleksbykov commented 9 months ago

@temichus @aleksbykov, I think checking raft group0 was already introduced, right ?

Yes, sct already checks consistency of raft group0 and token ring members. It only checks that group0 members and token ring members are the same. It is on of the step of cluster_health_checker.

kbr-scylla commented 9 months ago

I think @bhalevy you're confusing STATUS with UP/DOWN failure detection state.

DN in nodetool status means "DOWN NORMAL". And IIUC SHUTDOWN is basically translated to NORMAL.

IIRC, The RPC sent by gracefully shutting down nodes to other nodes causes the other nodes to:

learn about the shutting down nodes' SHUTDOWN status
at the same time, mark the shutting down node as DOWN

so the UN node becomes a DN node in the POV of these other nodes.

All that said, I don't understand the message in this cluster health check. It basically says: "status in nodetool is NORMAL, but in gossipinfo it's NORMAL". Doesn't make sense. Yes it mentions that in nodetool it's DN but it doesn't mention about UP/DOWN in gossip info. I guess that "NORMAL" in this context should translate to "UN' and then the proper message really should be "node is marked as DOWN in nodetool, but UP in gossipinfo" or if we want to add the extra info about status: "node is DN in nodetool but UN in gossipinfo".

@fruch @aleksbykov group 0 consistency check has nothing to do with this. What @bhalevy had in mind probably was that we are moving STATUS from gossip to raft-based topology. This is only functioning under experimental consistent-topology-changes for now.

But it won't fix the issue here, because the discrepancy is in the failure detection state (UP/DOWN), not STATUS.

fruch commented 9 months ago

one more addition to how SCT is doing those checks, it's not doing once and raise a failure, it retries it multiple times, and regardless what happened in a nemesis, the expected when we are checking between then is that the cluster is back in fully working fashion, if it's not, there an issue to investigate

we aren't going to keep a model of what should the status of the cluster, just for the sake of having clearer messages here. this check is serving us well enough so far

if the wording of one message or the other can be improve, that's someone anyone can contribute to.

scylladb / scylla-cluster-tests

check_node_status_in_gossip_and_nodetool_status generates unclear error on non-graceful shutdown #6848