valkey-io / valkey

A flexible distributed key-value datastore that is optimized for caching and other realtime workloads.
https://valkey.io
Other
17.36k stars 656 forks source link

[cluster-bus] Send a MEET packet to a node if there is no inbound link #1307

Open pieturin opened 1 week ago

pieturin commented 1 week ago

In some cases, when meeting a new node, if the handshake times out, we can end up with an inconsistent view of the cluster where the new node knows about all the nodes in the cluster, but the cluster does not know about this new node (or vice versa). To detect this inconsistency, we now check if a node has an outbound link but no inbound link, in this case it probably means this node does not know us. In this case we (re-)send a MEET packet to this node to do a new handshake with it.

This fixes the bug described in #1251.

codecov[bot] commented 1 week ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 70.67%. Comparing base (32f7541) to head (6c67d41).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## unstable #1307 +/- ## ============================================ - Coverage 70.69% 70.67% -0.02% ============================================ Files 115 115 Lines 63153 63163 +10 ============================================ Hits 44643 44643 - Misses 18510 18520 +10 ``` | [Files with missing lines](https://app.codecov.io/gh/valkey-io/valkey/pull/1307?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=valkey-io) | Coverage Δ | | |---|---|---| | [src/cluster\_legacy.c](https://app.codecov.io/gh/valkey-io/valkey/pull/1307?src=pr&el=tree&filepath=src%2Fcluster_legacy.c&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=valkey-io#diff-c3JjL2NsdXN0ZXJfbGVnYWN5LmM=) | `86.20% <100.00%> (+0.01%)` | :arrow_up: | ... and [13 files with indirect coverage changes](https://app.codecov.io/gh/valkey-io/valkey/pull/1307/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=valkey-io)
pieturin commented 1 week ago

What if we disconnect the outbound link if inbound link is not available?

In this case we would just re-open an outbound connection, which the other node will accept, but it won't force the other node to recognize us as being part of the cluster if it doesn't trust us yet. The only way to force the other node to add us to its cluster view is for us to send a MEET packet.

hpatro commented 1 week ago

What if we disconnect the outbound link if inbound link is not available?

In this case we would just re-opened an outbound connection, which the other node will accept, but it won't force the other node to recognize us as being part of the cluster if it doesn't trust us yet. The only way to force the other node to add us to its cluster view is for us to send a MEET packet.

CLUSTER MEET is an admin operation but I guess we are fine with the case of reinitiating it if the operation wasn't successful in first place and retry it.

madolson commented 21 hours ago

only clear the CLUSTER_NODE_MEET flag when myself receive a "ack" (not the plain PONG but something with a strong ack, ack that sender has already meet myself?

Do you mean by like adding a new flag? I think the concern is we could still end up in the inverse state, where the the node that received the "strong" ack will put the other node online but then might go offline.

My original thought was that as long as one node believes the other is part of the cluster, is should try to have the other node join. It's sort of like an "enhanced" version of how we built up the mesh when two disjoin clusters meet each other.