[NEW] Deterministic CLUSTER SHARDS Command

barshaul commented 4 months ago

The problem/use-case that the feature addresses

Currently, some clients rely on a single-node perspective to obtain cluster topology, which may not accurately reflect the entire cluster's state. Ideally, clients should query multiple nodes and compare their topology views to determine the most consistent representation. To facilitate efficient comparison on the client side, if Valkey can offer a deterministic view where nodes with identical views return identical outputs, clients could easily hash and compare these outputs. ATM, certain variables in the CLUSTER SHARDS command, such as "replication-offset," may differ between nodes even when they share the same view, so the client must parse the view and filter out each view by itself. Furthermore, in the current output of CLUSTER SHARDS (or CLUSTER SLOTS), nodes with identical views may present the replicas of specific slots in differing orders. This necessitates clients to parse and sort the outputs for accurate view comparison.

Description of the feature

In order to achieve deterministic output, 2 changes are suggested:

Adding a filter option to the CLUSTER SHARDS command. This enhancement would enable clients to selectively filter out irrelevant fields for their specific use cases, such as non-deterministic fields, nodes in 'fail' state that irrelevant for the client functionality, or other fields that the client doesn't need. It would enable hash-based comparison of views without the need for parsing the output, while also potentially reducing the size of each node's output.
Sorting the replica order in the output.

madolson commented 4 months ago

Adding a filter option to the CLUSTER SHARDS command. This enhancement would enable clients to selectively filter out irrelevant fields for their specific use cases, such as non-deterministic fields, nodes in 'fail' state that irrelevant for the client functionality, or other fields that the client doesn't need. It would enable hash-based comparison of views without the need for parsing the output, while also potentially reducing the size of each node's output.

I'm generally against a lot of complexity like what we would need for filtering. So I would prefer we just add CLUSTER SHARDS DETERMINISTIC (I don't like the word, maybe canonical?) and allow clients to optionally call that and we uphold that type of contract on the server.

Sorting the replica order in the output.

I agree with this.

daniel-house commented 3 months ago

I like the word deterministic. If the output is not sorted it can't be either deterministic or canonical, so I like it being sorted.

Clearly you feel that the fields are not presented in a way that is easy for the client to filter. Perhaps deterministic or canonical, or one other keyword, could change the format to something that would be trivial to filter - maybe even JSON.

madolson commented 3 months ago

Clearly you feel that the fields are not presented in a way that is easy for the client to filter. Perhaps deterministic or canonical, or one other keyword, could change the format to something that would be trivial to filter - maybe even JSON.

Not sure if the format is the concern. Last time I discussed with Bar ( who works at AWS) was that it was simpler and faster to do it without parsing. Maybe there is more to it though.

PingXie commented 3 months ago

The way I see it, the problem with CLUSTER SHARDS is that it mixes both "configs", cluster topology in this case, and "states" information in the output, such as replication offset and health. So maybe another word for "deterministic" is "topology", which reflects better the requirement here. "config" would work for me too.

"deterministic" on the other hand is a bit subjective and leaves room for different interpretation, IMO.

zuiderkwast commented 3 months ago

~~@madolson Why is this a major decision?~~ Sorry, I'm reading properly. Not.

Sorting the nodes can't be a breaking change (like @hpatro did for CLUSTER SLOTS in #265).

Regarding filtering arguments, I'm not so sure... For hashing the output without parsing, I think that's one of the reasons CLUSTER NODES returns a string rather than RESP-structured data.

For structured RESP reply like this, a client needs to parse the RESP reply to know where the end of the reply is. You'd need a very special parser to find the length of the response without parsing it. So I think it'd fine that clients can get the reply, then remove some fields (which we can document), then compute a hash of the rest of it.

PingXie commented 3 months ago

So I think it'd fine that clients can get the reply, then remove some fields (which we can document), then compute a hash of the rest of it.

I think the proposal is to have the server remove volatile fields, as opposed to clients doing the filtering.

zuiderkwast commented 3 months ago

@PingXie I know that's the suggestion, but adding filter arguments to the command is a larger change than to document what clients need to do if they want to compute this kind of hash.

PingXie commented 3 months ago

I'd think a binary filtering would be acceptable, like "topology or all". There is a straight story here IMO. Fine-grained filtering, agreed.

daniel-house commented 3 months ago

The problem seems to be

Currently, some clients rely on a single-node perspective to obtain cluster topology, which may not accurately reflect the entire cluster's state.

The rest of the initial post appears to be offering ways to make it easier for a client to get an accurate picture of the entire cluster's state.

I think this problem may be theoretically unsolvable. If so, let us seek a sufficiently useful approximate picture.

Suppose we had some report from each node that was deterministically comparable, as requested by Bar. (Note: there is nothing subjective about the word deterministic. As Bar wrote: nodes with identical views return identical outputs. However, the choice of deterministic output is subjective as it depends on how we will use it.) Would it be sufficient to find a quorum that is reporting exactly the same state? How big would the quorum have to be? Can we determine if the cluster's topology is in flux due to adding or removing nodes, slot-migration or fail-over, so that we might need to try again later?

What is the ultimate use case? Why do we need an accurate picture of the entire cluster's state? Are we presenting a dashboard to the admins? Are we trying to update the client's slot-map for following moved keys? Are we building a replacement for Sentinel? Do we want to give advice for re-sharding? All of the above?

Would the report be more useful for more use cases if the server did some filtering? Would it be more useful if it were trivial to parse using easily available libraries (e.g., JSON)?

Should it be one or more new CLUSTER subcommands such as STATE, TOPOLOGY, or HEALTH?

daniel-house commented 3 months ago

There was some discussion of cluster V2 in https://github.com/valkey-io/valkey/issues/58 .

Could someone link information about cluster V2 into this issue? I don't know anything about it except what I gathered from issue 58, but it seems to me that cluster V2 might provide a better solution for the use case that motivates this discussion of CLUSTER SHARDS.

PingXie commented 3 months ago

What is the ultimate use case?

cluster v1 is eventual consistent (if ever). whenever there is a topology change, such as adding replicas, scaling out/re-sharding, or performing rolling upgrades, there needs to be a way for the client to detect that the cluster topology quiesces before the client can safely update its view of the slot->node mapping. The signal that @barshaul is looking for here is that "majority of nodes agree on the cluster topology", as reported by CLUSTER SHARDS.

Why do we need an accurate picture of the entire cluster's state?

Request routing is performed by the client based on the key hash. The client computes the crc16 of the key in question to figure out the (logical) slot it belongs to. However, in order to figure out which node hosts the slot, the client relies on this slot->node mapping, which is obtained by one of the cluster topology query commands such as CLUSTER SHARDS.

Are we presenting a dashboard to the admins?

Not the primary use case

Are we trying to update the client's slot-map for following moved keys?

It is a type of cluster topology change.

Are we building a replacement for Sentinel? Do we want to give advice for re-sharding? All of the above?

Orthogonal to sentinel. This is all about clusters.

Would the report be more useful for more use cases if the server did some filtering?

Server filter vs client filtering boils down to implement-it-once (on the server side) vs implement-it-many-times (for every client interested)

Would it be more useful if it were trivial to parse using easily available libraries (e.g., JSON)?

That would be a breaking change.

daniel-house commented 3 months ago

I would never suggest replacing what we have today with JSON because that would indeed be a breaking change. Would allowing the client to request JSON instead of the current format be better than having the client specify a filter?

Please help me continue to pin down the requirements. At this point they appear to be to provide a way for the client to 1) detect that the cluster topology is not in flux, or 2) determine that a majority (quorum = greater than 50%) of nodes agree on the cluster topology, or 3) determine the most consistent representation of the current topology of the cluster, or 4) all of the above.

In all of these cases I think we can do much better by providing a new CLUSTER subcommand.

For example, suppose the goal is (2) determine that we have a quorum (the percentage might be a variable). We could make it much easier to wait for a quorum by associating a unique, non-decreasing ID with each new topology. Then the client only needs to poll until a quorum reports the same maximal ID. Note: I have an intense dislike for polling, and would look for a better solution.

I can think of lots of possible solutions, but I'd rather wait until we have a solid agreement on the goal.

PingXie commented 3 months ago

JSON or not is not at the core of this problem. The output of the existing CLUSTER SHARD is already conceptually JSON. The content is actually what is being discussed in this thread.

associating a unique, non-decreasing ID with each new topology

Can you elaborate how this ID would be maintained? There is the epoch that has this monotonically increasing property but it is a per shard concept and it is not reported by CLUSTER SHARDS

I can think of lots of possible solutions, but I'd rather wait until we have a solid agreement on the goal.

Yes and I would say no.2 and 3 are my understanding of the goal. that said, these are unfortunately band-aid requirements/solutions for cluster v1 IMO. I say we really need the strong consistency for the topology (https://github.com/valkey-io/valkey/issues/58#issuecomment-2063091448)

valkey-io / valkey

[NEW] Deterministic CLUSTER SHARDS Command #114