Open bhalevy opened 10 months ago
@fruch - who should own this item?
I don't know exactly what is expected here
To not use gossipinfo ? To check raft status ?
@temichus @aleksbykov, I think checking raft group0 was already introduced, right ?
Also if @bhalevy has tagged the people at the, I would expect some of them to care, and decide what is needed.
@temichus @aleksbykov, I think checking raft group0 was already introduced, right ?
Yes, sct already checks consistency of raft group0 and token ring members. It only checks that group0 members and token ring members are the same. It is on of the step of cluster_health_checker.
I think @bhalevy you're confusing STATUS with UP/DOWN failure detection state.
DN in nodetool status
means "DOWN NORMAL".
And IIUC SHUTDOWN is basically translated to NORMAL.
IIRC, The RPC sent by gracefully shutting down nodes to other nodes causes the other nodes to:
so the UN node becomes a DN node in the POV of these other nodes.
All that said, I don't understand the message in this cluster health check. It basically says: "status in nodetool is NORMAL, but in gossipinfo it's NORMAL". Doesn't make sense. Yes it mentions that in nodetool it's DN but it doesn't mention about UP/DOWN in gossip info. I guess that "NORMAL" in this context should translate to "UN' and then the proper message really should be "node is marked as DOWN in nodetool, but UP in gossipinfo" or if we want to add the extra info about status: "node is DN in nodetool but UN in gossipinfo".
@fruch @aleksbykov group 0 consistency check has nothing to do with this. What @bhalevy had in mind probably was that we are moving STATUS from gossip to raft-based topology. This is only functioning under experimental consistent-topology-changes for now.
But it won't fix the issue here, because the discrepancy is in the failure detection state (UP/DOWN), not STATUS.
one more addition to how SCT is doing those checks, it's not doing once and raise a failure, it retries it multiple times, and regardless what happened in a nemesis, the expected when we are checking between then is that the cluster is back in fully working fashion, if it's not, there an issue to investigate
we aren't going to keep a model of what should the status of the cluster, just for the sake of having clearer messages here. this check is serving us well enough so far
if the wording of one message or the other can be improve, that's someone anyone can contribute to.
When a node goes down non-gracefully, e.g. of an instance is terminated, or the server process is terminated and it takes long enough until it comes back up, other nodes will report it as
DN
innodetool status
, but itsSTATUS
innodetool gossipinfo
will be reported asNORMAL
. This is the excepted gossip behavior. The reason is that in graceful shutdown, the node going down sends a gossip shudown RPC to the othernodes to let them know it is shutting down, so they change its gossip STATUS toshutdown,true
, but otherwise it remains asNORMAL
.Seen in https://argus.scylladb.com/test/0a6e192a-1bdb-4b49-b6c2-e796ac5d1c99/runs?additionalRuns%5B%5D=823076dd-b8a2-46c4-b8a9-3e2c2faa10cc:
In this case node-5 disappeared unexpectedly, and it is unclear from the error message without understanding the low-level details.
Also, if we have a "kill" nemesis where nodes are taken down ungracefully this error will be triggered, yet it should be expected.
I think that if the test has a model that can tell what is the expected status in
nodetool status
for each node, it would be enough for acheck_node_status
function. Althoughnode gossipinfo
is an official interface, I'm afraid it's tool low-level and subject to change to be able to count on it for system-level tests such as SCT.If and when we move all topology decisions to raft we should be testing the raft status instead of the gossip status since it'd be more reliable, and can be exposed as an official interface to the user.
Cc @asias @kostja @kbr-scylla @tgrabiec