valkey-io / valkey

A flexible distributed key-value datastore that is optimized for caching and other realtime workloads.
https://valkey.io
Other
17.61k stars 663 forks source link

[NEW] Cluster-wide SCAN #33

Open madolson opened 8 months ago

madolson commented 8 months ago

The problem/use-case that the feature addresses

Implement a scan cursor that can consistently scan the entire cluster, instead of today which requires individually targeting each node and sending it an individual SCAN command. This can also break, as slots can be migrated off the node and failovers can happen.

Description of the feature

Implement a consistent cluster scan with semantics like:

CSCAN <cluster cursor> [MATCH pattern] [COUNT count] [TYPE type]

The cluster cursor would be marked as NOT_KEY, but would be hashed like all other keys by the clients so that they would be routed to the next node. The cursor would contain a component that includes a hashtag, to represent the slot it's currently scanning.

the format of the cursor would be:

<version>-{hashtag}-<slot db id>

Alternatives you've considered

Extending the existing SCAN cursor to support scanning a specific slot, something like:

SCAN <cluster cursor> [SLOT X] [MATCH pattern] [COUNT count] [TYPE type]

This would require the end user to specific a specific slot, and the scan command would parse just that specific slot.

Additional information

See https://github.com/redis/redis/issues/2702.

NikolaBorisov commented 8 months ago

I think also adding

SCAN <cluster cursor> [SLOT X] [MATCH pattern] [COUNT count] [TYPE type]

is important because it lets you scan the cluster in parallel. CSCAN can not be parallelized. My use case for this was when I have a large cluster and I want to iterate over all the keys and change them somewhat, but I want to do this in parallel. The cluster could be so large that using CSCAN could take super long time. I think the right abstraction would be just to allow user to specify which slot they want to scan. It is very easy to build something that scans the whole cluster reliably if you have that.

madolson commented 8 months ago

Ok, so maybe something that would be make me "happyish":

CSCAN <cluster cursor> [SLOT X] [MATCH pattern] [COUNT count] [TYPE type]

Which just scans the given slot if it's provided. We can still have the be marked as a key, so that your client will route it for you. If you're really smart, you could reverse engineer a cursor that hits the right node and we could make some way to define that, but I think it makes sense to just have it be 0 or empty string.

There is a problem with my proposal. On the other thread, https://github.com/placeholderkv/placeholderkv/issues/4, means that it's not safe to do cross cluster scans without restarting the slot on moved, since the hashes in each databases aren't stable.

nihohit commented 8 months ago

does CSCAN <cluster cursor> [SLOT X] [MATCH pattern] [COUNT count] [TYPE type] mean that the cursor is only good for iterations with the same slot? what would be the cursor semantics be?

madolson commented 8 months ago

If SLOT is provided, it would only be valid for the SLOT specified. If it's omitted, it would do a scan across all slots in the cluster.

nihohit commented 8 months ago

If SLOT is provided, it would only be valid for the SLOT specified. If it's omitted, it would do a scan across all slots in the cluster.

I wonder whether a cursor might become "accidentally" usable between regular SCAN, CSAN, and CSCAN+slot calls, simply because its computed in the same way for each call.

The cursor would contain a component that includes a hashtag, to represent the slot it's currently scanning.

So, this means that a CSCAN might return a MOVED error if there was slot migration? If so, I think that it solves the issue well, but this requires a lot of heavy lifting from the cursor. For example, assuming CSCAN goes by order of slots, if a node contains slot 1 & 3 but not 2, CSCAN without slots will need to return the keys from slot 1, even if they're below COUNT, and then answer the next command with a MOVED to the node with slot 2, which will in turn respond with the keys in slot 2 and a MOVED back to the first node. This allows the user to scan across the cluster, but it's not a great experience.

IMO this can be combined with a command that quickly (unlike CLUSTER NODES/SLOTS/SHARDS, which can be very slow in large, fragmented clusters) returns the slots available on the current node. Let's call it CLUSTER SLOTSLOCAL, and it only returns the slots available on the current node - no data on other nodes in the cluster. That way the can pipeline CSCAN calls with CLUSTER SLOTSLOCAL without a significant perf penalty, and quickly know whether there was slot migration. Once the CSCAN calls on this node complete, the user knows exactly which slots were covered (if there wasn't any change), or retry with CSCAN SLOT slot_id for slots that were added during the process.

madolson commented 8 months ago

I wonder whether a cursor might become "accidentally" usable between regular SCAN, CSAN, and CSCAN+slot calls, simply because its computed in the same way for each call.

That's possible, and we could do that. The only concern is that with SCAN, the client doesn't expect it to need routing. We introduced the concept of "NOT A KEY" in Redis 7, that still requires routing.

This allows the user to scan across the cluster, but it's not a great experience.

Not quite. Let's simplify, there are 2 nodes (A and B) with 3 slots (A and 1 and 3, B has 2). All slots have 100 keys. The command ordering would look like:

B> CSCAN 0 COUNT 50 -> {slot:0}-0 and []
A> SCAN {slot:0}-0 COUNT 50-> {slot:0}-50 and [50 keys]
A> SCAN {slot:0}-50 COUNT 50-> {slot:1}-0 and [50 keys] // Notice how the slot was updated, we returned the remaining keys and the slot is empty.
B> SCAN {slot:1}-0 COUNT 50-> {slot:1}-50 and [50 keys]
B> SCAN {slot:1}-50 COUNT 50-> {slot:2}-0}and [50 keys]
A> SCAN {slot:2}-0 COUNT 50-> {slot:2}-50 and [50 keys]
A> SCAN {slot:2}-50 COUNT 50-> 0 and [50 keys] // We got a zero back, we're done!

At no point are we getting a moved, since we're routing based on the slot information, and the client knows that. You're right that if there are few overall keys, we might not have a very high density. We could optimize that by also including data from the next slot if the node has it though.

That way the can pipeline CSCAN calls with CLUSTER SLOTSLOCAL without a significant perf penalty, and quickly know whether there was slot migration. Once the CSCAN calls on this node complete, the user knows exactly which slots were covered (if there wasn't any change), or retry with CSCAN SLOT slot_id for slots that were added during the process.

The ask has just been to parallelize it, which you could still do. If you have like a million keys, we're over indexing on performance, since it'll finish fast. If you have 10 billion keys (~1 million keys per slot), then parallelization makes sense.

nihohit commented 8 months ago

Let's simplify, there are 2 nodes (A and B) with 3 slots (A and 1 and 3, B has 2)

This scenario works for a cluster that is stable, but what happens during slot migrations, or scale in/out? For simplicity's sake, what happens if the client isn't aware of slot 2 moving from A to B? what would happen on a call A> SCAN {slot:1}-0 COUNT 50? It seems like the correct response is either

what would happen on a A> SCAN {slot:1}-0 COUNT 200 - where the count is larger than the number of entries in the slot? should A return {slot:3}-0 and [200 keys from slots 0 & 2, and implicitly skip slot 1? should it return {slot:1}-0 and [100 keys from slot 0], and under-provide on the COUNT in order to correctly reflect the missing slot?

Notice that in these examples the calls are without the [SLOT 1] argument - there's nothing explicitly requiring the queried node to contain slot 2.

Let's take a scenario in which the client only calls CSCAN, and doesn't perform any other operations - how would such a client correctly scan through a cluster undegoing changes?

madolson commented 8 months ago

if the client isn't aware of slot 2 moving from A to B

Then it'll get a MOVED message and try again. This is the normal behavior for misunderstanding the topology.

what would happen on a A> SCAN {slot:1}-0 COUNT 200 - where the count is larger than the number of entries in the slot? should A return {slot:3}-0 and [200 keys from slots 0 & 2, and implicitly skip slot 1

The current implementation only returns data from one slot at a time, which is our atomic unit, we would need to restart at the next slot. I mentioned an optimization, but I think you would probably want to opt in to it.

nihohit commented 8 months ago

The current implementation only returns data from one slot at a time

Oh, excellent. I didn't notice that this in the documentation. This solves my issues :)

avifenesh commented 8 months ago

Hope Im not repeating something, couldn't find some mention of it. What about clusters with small amount of keys, lets say 200 heavy keys distributed to different slots, would i need to run 16384 scans to get all the keys in my cluster? The default count is 10 in the classic implementation so it would be around 20 calls, here its seems to be almost x1000 more calls. The classic implementation is based on the node dictht which its size and the inside key distribution is kind of equivalent to the amount of keys (worse case amount of keys X 2), which promise some efficiency in this kind of scenarios. With the new impl' offered, if the amount of keys i have is smaller than the 163840 i still need to add my own impl' for scanning efficiently.

ranshid commented 8 months ago

What about clusters with small amount of keys, lets say 200 heavy keys distributed to different slots, would i need to run 16384 scans to get all the keys in my cluster? The default count is 10 in the classic implementation so it would be around 20 calls, here its seems to be almost x1000 more calls.

I think this is a valid point, which will be SOMEWHAT handled in case we will continue to consume continuance slot ranges (which I support). This would leave the issue as more impactful for fragmented slot ranges, but even in case of fragmented slot ranges we expect to have some continuation. and there is always the ability of the client to fanout on all slots in case the application understand it's workload.

I would like to ask 2 other questions:

  1. How do we plan to make these commands available via scripts/transactions? I would imagine CSCAN will probably be non-script, but SCAN with specific slot will be available right?
  2. one thing that I think is missing from current scan, is filter by TTL. can we consider adding such an option?
CharlesChen888 commented 7 months ago

Note that [pattern] may imply a specific slot, and this is useful when slot is not provided, or when the provided slot is different from the slot pattern implies.

madolson commented 7 months ago

How do we plan to make these commands available via scripts/transactions? I would imagine CSCAN will probably be non-script, but SCAN with specific slot will be available right?

Why would CSCAN be non-script? I don't see anything that would strictly break from it.

one thing that I think is missing from current scan, is filter by TTL. can we consider adding such an option?

Sounds like a separate issue? What is the use case to filter by TTL?

zuiderkwast commented 1 month ago

I know @madolson you don't want the end users to be aware of slots, but we can say that the intention is that client libraries should handle the slots and provide Scan functionality to the end user without exposing the slots. Then it should be fine?

I believe scanning per slot is the most robust way that also allows nodes to be scanned in parallel.

I'm not sure about the C prefix in CSCAN though. I know Redis added a command called SFLUSH to flush all keys in a slot, but the S prefix is normally for sets, like SADD, so that's pretty bad imo. C isn't used as a prefix yet though, but I think CLUSTER SCAN is better. It puts the command within the cluster-specific namespace, so non-cluster users don't get confused by it.

madolson commented 1 month ago

I know @madolson you don't want the end users to be aware of slots, but we can say that the intention is that client libraries should handle the slots and provide Scan functionality to the end user without exposing the slots. Then it should be fine?

I'm OK with this yes. The point of the cluster abstraction is that end users just see a flat keyspace. I have a secondary goal of I don't want clients to have to do much work, but given that Glide was willing to fully track the entire slot space (See https://github.com/valkey-io/valkey-glide/pull/1623) to do a full cluster scan I guess that point isn't really that important anymore.

I believe scanning per slot is the most robust way that also allows nodes to be scanned in parallel.

I agree? I'm not sure if this is disagreeing with my point, but I was just positing that we can re-use a lot of existing functionality if we encode the slot into the cursor. I also thought it might be possible to someday in the future incorporate other types of hashing (like consistent hashing).

C isn't used as a prefix yet though, but I think CLUSTER SCAN is better.

I have a strong preference not to put something in the CLUSTER container unless it's actually necessary. There is a lot of annoyance around CLIENT having a mix of admin commands and steady state commands, so it makes ACLs hard to reason about. I think we can't use SSCAN because it already existed (with set scan). I assume SFLUSH was following after SPUBLISH, which stood for sharded publish. I don't like CSCAN, but it was the best I could come up with.

zuiderkwast commented 1 month ago

I agree? I'm not sure if this is disagreeing with my point, but I was just positing that we can re-use a lot of existing functionality if we encode the slot into the cursor.

I didn't mean to say you disagree about this one. :smile:

Btw, we already encode the slot in the cursor, though it's undocumented.

I also thought it might be possible to someday in the future incorporate other types of hashing (like consistent hashing).

How are you thinking? The slot mapping is already a superset of consistent hashing. Or... replace CRC16 with ... SHA256? I think it would be a very big breaking change to cluster clients if we change the slot concept.

CRC16 is not fair, but we could already compensate for this. We can create a pre-computed heat map that say how likely a random key is to hash to each slot, that we can use when scaling to compensate for the unfairness of CRC16.

I have a strong preference not to put something in the CLUSTER container

Fair point about not mixing with admin commands. (CLUSTER SLOTS is not an admin command though.)

How about the already suggested names SLOTSCAN, or SCAN with a SLOT argument? They are clear and intuitive to me.

madolson commented 1 month ago

Btw, we already encode the slot in the cursor, though it's undocumented.

Sure, but it doesn't work with any existing cluster routing. It's also undocumented because it's fragile and might change.

How are you thinking? The slot mapping is already a superset of consistent hashing. Or... replace CRC16 with ... SHA256? I think it would be a very big breaking change to cluster clients if we change the slot concept.

Let's say we move away from 2^14 slots. Are we okay with adding a net new command, or a new argument, to add support for that new backend? The idea that I proposed, CSCAN <cluster cursor> [MATCH pattern] [COUNT count] [TYPE type], would work regardless of the underlying number or type of slots. Which is why it felt more like a user command to me, since it represents the abstraction we want to show end users, the flat keyspace.

SLOTSCAN

Slight preference for this, since I would like it to be it's own command with it's own documentation page. We could also just do CLUSTERSCAN I guess. It doesn't have the ACL baggage.

madolson commented 1 month ago

In addition, I'm also pretty chill with this:

CLUSTERSCAN <cluster cursor> [MATCH pattern] [COUNT count] [TYPE type] [SLOT number]

The default scans all slots. If you specify a specific slot, it scans just that slot.

zuiderkwast commented 1 month ago

Let's say we move away from 2^14 slots.

It's such a breaking change that we'll probably never do it, and if we do, we could still use the slot concept, though we may want to accept slot ranges everywhere rather than single slots, if the total number of slots is very large.

CLUSTERSCAN <cluster cursor> [MATCH pattern] [COUNT count] [TYPE type] [SLOT number]

Yes! A accept this one. This works for the two use cases:

  1. without slots, scan the node until you get a redirect to the next shard
  2. with a slot, makes it possible to scan multiple nodes in parallel without worrying about slot migrations

The version without a slot is a bit risky if a slot migration happens during the scan, well, at least if you try to scan multiple shards in parallel. I don't want the client to feel the need do a 16K calls to CLUSTER SLOTS to detect that (is it what GLIDE does?). With a specified slot, you would get a redirect if the slot has been migrated. (We could even consider a slot range and return some error if the node doesn't own all slots in the range. This can be a future parameter.)

madolson commented 1 month ago

I don't want the client to feel the need do a 16K calls to CLUSTER SLOTS to detect that (is it what GLIDE does?).

GLIDE does a normal SCAN, if during the scan it detects the slot configuration changed, it will redo the scan. This maintains the current guarantee that items can be returned multiple times. Once it's done a full scan, it will indicate which slots it has scanned. I'm not sure it returns a serializable cursor though, @avifenesh would know though.

(We could even consider a slot range and return some error if the node doesn't own all slots in the range. This can be a future parameter.)

True.

avifenesh commented 1 month ago

I don't want the client to feel the need do a 16K calls to CLUSTER SLOTS to detect that (is it what GLIDE does?).

GLIDE does a normal SCAN, if during the scan it detects the slot configuration changed, it will redo the scan. This maintains the current guarantee that items can be returned multiple times. Once it's done a full scan, it will indicate which slots it has scanned. I'm not sure it returns a serializable cursor though, @avifenesh would know though.

(We could even consider a slot range and return some error if the node doesn't own all slots in the range. This can be a future parameter.)

True.

Glide scan nodes normally and sign the slots they own as scanned when a node is completed. If some slots are added (epoch bump) the node is not valid and we don't mark its slots as scanned. If some slots moved out (no epoch change) we mark the remaining slots as scanned. We keep going to the owner of the next slots that haven't scanned.

Glide call cluster slots to a small number of nodes after each node completion. So basically calls to slots as the number of shards.

We return to the user RC pointing to the object with the Scan progress (slot scanned, node in scan, epoch at the beginning of the node Scan).

avifenesh commented 3 weeks ago

In addition, I'm also pretty chill with this:

CLUSTERSCAN <cluster cursor> [MATCH pattern] [COUNT count] [TYPE type] [SLOT number]

The default scans all slots. If you specify a specific slot, it scans just that slot.

I'm afraid about the cost of multiple requests for one iteration, if the slots are not in the same shard, the call would need to be directed to many, instead of one by one. In the worst case, it can be pretty expensive. Maybe also in the common case. Clients can solve that by using a topology map and choosing slots in the same shard but it, but is it the intention to still need the client to implement some method for smart scan?

madolson commented 3 weeks ago

I'm afraid about the cost of multiple requests for one iteration, if the slots are not in the same shard, the call would need to be directed to many, instead of one by one. In the worst case, it can be pretty expensive. Maybe also in the common case.

This doesn't seem like that material of an issue to me. The point is to make the scan more scalable, so we are more worried about the large cluster case. I didn't really follow the last sentence, what does "Maybe also in the common case" mean?

Clients can solve that by using a topology map and choosing slots in the same shard but it, but is it the intention to still need the client to implement some method for smart scan?

The goal is to balance the complexity on the server and client. The current implementation in glide, imo, is overly complicated to try to replicate the guarantees of the server. It seems like with a rather small change to the server-side cursor the client work becomes much simpler.