[NEW] Redis consistent cluster SCAN

madolson commented 3 months ago

The problem/use-case that the feature addresses

Implement a scan cursor that can consistently scan the entire cluster, instead of today which requires individually targeting each node and sending it an individual SCAN command. This can also break, as slots can be migrated off the node and failovers can happen.

Description of the feature

Implement a consistent cluster scan with semantics like:

CSCAN <cluster cursor> [MATCH pattern] [COUNT count] [TYPE type]

The cluster cursor would be marked as NOT_KEY, but would be hashed like all other keys by the clients so that they would be routed to the next node. The cursor would contain a component that includes a hashtag, to represent the slot it's currently scanning.

the format of the cursor would be:

<version>-{hashtag}-<slot db id>

Alternatives you've considered

Extending the existing SCAN cursor to support scanning a specific slot, something like:

SCAN <cluster cursor> [SLOT X] [MATCH pattern] [COUNT count] [TYPE type]

This would require the end user to specific a specific slot, and the scan command would parse just that specific slot.

Additional information

See https://github.com/redis/redis/issues/2702.

NikolaBorisov commented 3 months ago

I think also adding

SCAN <cluster cursor> [SLOT X] [MATCH pattern] [COUNT count] [TYPE type]

is important because it lets you scan the cluster in parallel. CSCAN can not be parallelized. My use case for this was when I have a large cluster and I want to iterate over all the keys and change them somewhat, but I want to do this in parallel. The cluster could be so large that using CSCAN could take super long time. I think the right abstraction would be just to allow user to specify which slot they want to scan. It is very easy to build something that scans the whole cluster reliably if you have that.

madolson commented 3 months ago

Ok, so maybe something that would be make me "happyish":

CSCAN <cluster cursor> [SLOT X] [MATCH pattern] [COUNT count] [TYPE type]

Which just scans the given slot if it's provided. We can still have the be marked as a key, so that your client will route it for you. If you're really smart, you could reverse engineer a cursor that hits the right node and we could make some way to define that, but I think it makes sense to just have it be 0 or empty string.

There is a problem with my proposal. On the other thread, https://github.com/placeholderkv/placeholderkv/issues/4, means that it's not safe to do cross cluster scans without restarting the slot on moved, since the hashes in each databases aren't stable.

nihohit commented 3 months ago

does CSCAN <cluster cursor> [SLOT X] [MATCH pattern] [COUNT count] [TYPE type] mean that the cursor is only good for iterations with the same slot? what would be the cursor semantics be?

madolson commented 3 months ago

If SLOT is provided, it would only be valid for the SLOT specified. If it's omitted, it would do a scan across all slots in the cluster.

nihohit commented 3 months ago

If SLOT is provided, it would only be valid for the SLOT specified. If it's omitted, it would do a scan across all slots in the cluster.

I wonder whether a cursor might become "accidentally" usable between regular SCAN, CSAN, and CSCAN+slot calls, simply because its computed in the same way for each call.

The cursor would contain a component that includes a hashtag, to represent the slot it's currently scanning.

So, this means that a CSCAN might return a MOVED error if there was slot migration? If so, I think that it solves the issue well, but this requires a lot of heavy lifting from the cursor. For example, assuming CSCAN goes by order of slots, if a node contains slot 1 & 3 but not 2, CSCAN without slots will need to return the keys from slot 1, even if they're below COUNT, and then answer the next command with a MOVED to the node with slot 2, which will in turn respond with the keys in slot 2 and a MOVED back to the first node. This allows the user to scan across the cluster, but it's not a great experience.

IMO this can be combined with a command that quickly (unlike CLUSTER NODES/SLOTS/SHARDS, which can be very slow in large, fragmented clusters) returns the slots available on the current node. Let's call it CLUSTER SLOTSLOCAL, and it only returns the slots available on the current node - no data on other nodes in the cluster. That way the can pipeline CSCAN calls with CLUSTER SLOTSLOCAL without a significant perf penalty, and quickly know whether there was slot migration. Once the CSCAN calls on this node complete, the user knows exactly which slots were covered (if there wasn't any change), or retry with CSCAN SLOT slot_id for slots that were added during the process.

madolson commented 3 months ago

I wonder whether a cursor might become "accidentally" usable between regular SCAN, CSAN, and CSCAN+slot calls, simply because its computed in the same way for each call.

That's possible, and we could do that. The only concern is that with SCAN, the client doesn't expect it to need routing. We introduced the concept of "NOT A KEY" in Redis 7, that still requires routing.

This allows the user to scan across the cluster, but it's not a great experience.

Not quite. Let's simplify, there are 2 nodes (A and B) with 3 slots (A and 1 and 3, B has 2). All slots have 100 keys. The command ordering would look like:

B> CSCAN 0 COUNT 50 -> {slot:0}-0 and []
A> SCAN {slot:0}-0 COUNT 50-> {slot:0}-50 and [50 keys]
A> SCAN {slot:0}-50 COUNT 50-> {slot:1}-0 and [50 keys] // Notice how the slot was updated, we returned the remaining keys and the slot is empty.
B> SCAN {slot:1}-0 COUNT 50-> {slot:1}-50 and [50 keys]
B> SCAN {slot:1}-50 COUNT 50-> {slot:2}-0}and [50 keys]
A> SCAN {slot:2}-0 COUNT 50-> {slot:2}-50 and [50 keys]
A> SCAN {slot:2}-50 COUNT 50-> 0 and [50 keys] // We got a zero back, we're done!

At no point are we getting a moved, since we're routing based on the slot information, and the client knows that. You're right that if there are few overall keys, we might not have a very high density. We could optimize that by also including data from the next slot if the node has it though.

That way the can pipeline CSCAN calls with CLUSTER SLOTSLOCAL without a significant perf penalty, and quickly know whether there was slot migration. Once the CSCAN calls on this node complete, the user knows exactly which slots were covered (if there wasn't any change), or retry with CSCAN SLOT slot_id for slots that were added during the process.

The ask has just been to parallelize it, which you could still do. If you have like a million keys, we're over indexing on performance, since it'll finish fast. If you have 10 billion keys (~1 million keys per slot), then parallelization makes sense.

nihohit commented 3 months ago

Let's simplify, there are 2 nodes (A and B) with 3 slots (A and 1 and 3, B has 2)

This scenario works for a cluster that is stable, but what happens during slot migrations, or scale in/out? For simplicity's sake, what happens if the client isn't aware of slot 2 moving from A to B? what would happen on a call A> SCAN {slot:1}-0 COUNT 50? It seems like the correct response is either

{slot:2}-50 and [50 keys], since these are the next values on node A, or
MOVED B, since the next slot for the scan is in B

what would happen on a A> SCAN {slot:1}-0 COUNT 200 - where the count is larger than the number of entries in the slot? should A return {slot:3}-0 and [200 keys from slots 0 & 2, and implicitly skip slot 1? should it return {slot:1}-0 and [100 keys from slot 0], and under-provide on the COUNT in order to correctly reflect the missing slot?

Notice that in these examples the calls are without the [SLOT 1] argument - there's nothing explicitly requiring the queried node to contain slot 2.

Let's take a scenario in which the client only calls CSCAN, and doesn't perform any other operations - how would such a client correctly scan through a cluster undegoing changes?

madolson commented 3 months ago

if the client isn't aware of slot 2 moving from A to B

Then it'll get a MOVED message and try again. This is the normal behavior for misunderstanding the topology.

what would happen on a A> SCAN {slot:1}-0 COUNT 200 - where the count is larger than the number of entries in the slot? should A return {slot:3}-0 and [200 keys from slots 0 & 2, and implicitly skip slot 1

The current implementation only returns data from one slot at a time, which is our atomic unit, we would need to restart at the next slot. I mentioned an optimization, but I think you would probably want to opt in to it.

nihohit commented 3 months ago

The current implementation only returns data from one slot at a time

Oh, excellent. I didn't notice that this in the documentation. This solves my issues :)

avifenesh commented 3 months ago

Hope Im not repeating something, couldn't find some mention of it. What about clusters with small amount of keys, lets say 200 heavy keys distributed to different slots, would i need to run 16384 scans to get all the keys in my cluster? The default count is 10 in the classic implementation so it would be around 20 calls, here its seems to be almost x1000 more calls. The classic implementation is based on the node dictht which its size and the inside key distribution is kind of equivalent to the amount of keys (worse case amount of keys X 2), which promise some efficiency in this kind of scenarios. With the new impl' offered, if the amount of keys i have is smaller than the 163840 i still need to add my own impl' for scanning efficiently.

ranshid commented 3 months ago

What about clusters with small amount of keys, lets say 200 heavy keys distributed to different slots, would i need to run 16384 scans to get all the keys in my cluster? The default count is 10 in the classic implementation so it would be around 20 calls, here its seems to be almost x1000 more calls.

I think this is a valid point, which will be SOMEWHAT handled in case we will continue to consume continuance slot ranges (which I support). This would leave the issue as more impactful for fragmented slot ranges, but even in case of fragmented slot ranges we expect to have some continuation. and there is always the ability of the client to fanout on all slots in case the application understand it's workload.

I would like to ask 2 other questions:

How do we plan to make these commands available via scripts/transactions? I would imagine CSCAN will probably be non-script, but SCAN with specific slot will be available right?
one thing that I think is missing from current scan, is filter by TTL. can we consider adding such an option?

CharlesChen888 commented 2 months ago

Note that [pattern] may imply a specific slot, and this is useful when slot is not provided, or when the provided slot is different from the slot pattern implies.

madolson commented 2 months ago

How do we plan to make these commands available via scripts/transactions? I would imagine CSCAN will probably be non-script, but SCAN with specific slot will be available right?

Why would CSCAN be non-script? I don't see anything that would strictly break from it.

one thing that I think is missing from current scan, is filter by TTL. can we consider adding such an option?

Sounds like a separate issue? What is the use case to filter by TTL?

valkey-io / valkey

[NEW] Redis consistent cluster SCAN #33