redis / redis

Redis is an in-memory database that persists on disk. The data model is key-value, but many different kind of values are supported: Strings, Lists, Sets, Sorted Sets, Hashes, Streams, HyperLogLogs, Bitmaps.
http://redis.io
Other
66.91k stars 23.79k forks source link

"CLUSTER MIGRATE slot node" command #2807

Open drabaioli opened 9 years ago

drabaioli commented 9 years ago

Feature request: At the moment, in order to migrate a slot from one node to another, the redis-trib.rb script is needed or it involves multiple operations on different nodes:

However this set of operations could be wrapped in a "native" command:

CLUSTER MIGRATE slot target-node

As far as I can see, once this command is integrated, the CLUSTER SETSLOT slot [NODE|MIGRATING|IMPORTING] command would not be needed anymore.

shaharmor commented 9 years ago

This will surely make things easier. +1 Maybe even make it possible to migrate multiple slots at the same time

drabaioli commented 8 years ago

I've considered migrating multiple slots at once but I think, due to the signature of the SETSLOTS commands to be performed, moving one slot per time would be mode correct/easier to implement.

giladwa commented 3 years ago

Is there an easy way in Redis 6 to migrate specific slots?

madolson commented 3 years ago

Going to add this to the Redis 7.0 sprint. The support doesn't exist natively today. We could implement something like BGSAVE though, where we fork and write out all the data to the target slot.

giladwa commented 3 years ago

@madolson is that what redis-cli --cluster reshard does but only with the ability to choose the specific slots?

madolson commented 3 years ago

@giladwa Not quite, it calls KEYSINSLOT and then calls the MIGRATE command on each key consecutively.

chendq8 commented 2 years ago

Going to add this to the Redis 7.0 sprint. The support doesn't exist natively today. We could implement something like BGSAVE though, where we fork and write out all the data to the target slot. @madolson Can I get an update on this?

madolson commented 2 years ago

@chendq8 Nothing much new here. We are planning a big revamp of cluster in Redis 8, hopefully it will be picked up then. (Redis 7 is just around the corner)

chendq8 commented 2 years ago

@chendq8 Nothing much new here. We are planning a big revamp of cluster in Redis 8, hopefully it will be picked up then. (Redis 7 is just around the corner)

@madolson Thank you for your reply. Is there any specific modification plan? Will compatibility with older protocols be considered?

madolson commented 2 years ago

@chendq8 What do you mean compatibility with older protocols?

chendq8 commented 2 years ago

I mean if “ big revamp” will change cluster message bus?

madolson commented 2 years ago

@chendq8 Yeah, there will likely be some changes to the clusterbus. We will make sure that updating to the version is seamless and fully compatible with existing versions of Redis though from the client perspective.

zuiderkwast commented 2 years ago

We're interested in this. Migrating some slots in a highly loaded system can take us an hour, due to the large reply of CLUSTER KEYSINSLOT and then the many MIGRATE calls moving only a few keys at a time not to block the nodes too long.

The source node can simply behave as if MIGRATE was called multiple times and communicate with the destination in the same old way. A simple way to implement it would be to do some work in the cron function, but I think forking might be a better option.

yossigo commented 2 years ago

There's already lot of room to improvement around MIGRATE, but ultimately I think we'd want serialization to take place off the main thread.

Forking a child is definitely one option. On the other hand, Redis Cluster already handles individual key migrations and doesn't rely on a point-in-time snapshot - so this might be an opportunity to re-evaluate ideas around key locking and thread offloading of slow operations.

zuiderkwast commented 2 years ago

Summarizing what's been discussed about CLUSTER MIGRATE in other places.

Benefits, if we do this right:

This is an draft of how the procedure could work. Since it's a complex feature, I'd prefer if we can agree about the ideas on a high level before starting the implementation.

  1. CLUSTER MIGRATE is called by a client. It is blocking the caller until the migration is finished or aborted.
  2. A fork is created that migrates the keys in the slot to the target node. In the meantime, writes to the slot are accumulated in the main process.
  3. When the forked process is done, the accumulated writes are transferred to the target node, like a replication stream.
  4. When the target has caught up in the replication, the source node pauses writes to the slot and tells the target node to request cluster consensus for the slot (new command?).
  5. The target node sends a new cluster bus message CLUSTERMSG_TYPE_SLOT_AUTH_REQUEST to all primaries.
  6. The primaries reply with a new message CLUSTERMSG_TYPE_SLOT_AUTH_ACK if the old and new slot owner make sense (nodes are known masters, epoch and slot ownership make sense).
  7. If a majority votes for the new slot owner, the new slot owner broadcasts its new slot mapping and epoch.
  8. When the source node realizes that it's no longer the slot owner, the paused writes to the slot are unpaused and the those clients receive a -MOVED redirect. The source node purges the slot.

In case of failure or timeout, the migration is aborted. The source node still contains all the keys and no data is lost. The target node purges any imported keys of the slot. (E.g. the source node sends a new command CLUSTER PURGE-SLOT to the target.)

A new cluster node flag bit might be needed to indicate that a node supports these new cluster bus messages. This can prevent the migration from even being attempted if most nodes in the cluster don't support this new migration.

madolson commented 2 years ago

I'll also add that we have someone assigned internally within AWS to open source what we have, which implements generally what you outlined with some caveats.

  1. CLUSTER MIGRATE is called by a client. It is blocking the caller until the migration is finished or aborted.

I don't think this is necessary, we can just return okay and indicate the migration is in progress.

  1. A fork is created that migrates the keys in the slot to the target node. In the meantime, writes to the slot are accumulated in the main process.

Although unspoken, the assumption here is we are generating RESTORE commands with the data and sending it a long. This is like replication, but since the target needs to continue serving writes we can't put it into the blocking replication state.

  1. When the target has caught up in the replication, the source node pauses writes to the slot and tells the target node to request cluster consensus for the slot (new command?).

Our internal implementation does this, we send a "slot migrate complete" command which does basically this. It acks if it still believes it's a master.

  1. If a majority votes for the new slot owner, the new slot owner broadcasts its new slot mapping and epoch.

I don't think we need an election here though, and want to generally advocate against cluster wide decisions that can be handled locally. The target is the owner of its slots, so if it accepts it and replicates it all is good. Note that failure modes without replicas/persistence don't really matter, since your data can vanish anyways. There are basically three failure modes here:

  1. The source dies after sending the complete command, but before replicating it to its replicas. The replica will then be promoted, believe it owns the slots, but since the epoch was bumped on the target, it'll realize it no longer owns the slots and purge them.
  2. The target dies after receiving the slot complete command, and before replicating/persisting it, so the replica thinks its still in a migration state. After promotion it will realize it has data for slots it doesn't own, then purge them.
  3. The case both die during the handoff is the same as two, requires special handling since we need to pick a replica to "take ownership" of the orphaned slot.

The only open question I want to dive into more deeply is what @yossigo mentioned about fork vs blocking. We have a forkless blocking approach internally, as well as a fork based one, and they have different trade offs. The forkless approach involves incrementally iterating through all of the keys, locking them, handing them to background threads to serialize and send them. This starts to have issues with larger keys, as the keys become blocked will also block access on the main Redis thread. It also requires a lot of intelligent parsing to understand, "what keys will a command access", in order to know to block it. I think we have the expertise to help implement that, but there is a lot of nuanced edge cases there.

The fork based approach on the flip side, if much simpler to implement, but you pay the CoW cost as well as the initial fork time. From and ROI perspective, I would advocate we start with the fork based approach for now and get the fundamentals in place, and we can incrementally build a forkless implementation after more discussion.

zuiderkwast commented 2 years ago

I'll also add that we have someone assigned internally within AWS to open source what we have, which implements generally what you outlined with some caveats.

@madolson That's good news!

There are basically three failure modes here:

  1. (...)
  2. The target dies after receiving the slot complete command, and before replicating/persisting it, so the replica thinks its still in a migration state. After promotion it will realize it has data for slots it doesn't own, then purge them.

If the target node takes ownership of a slot and then fails over before its replicas know that, then we may end up losing the whole slot if we're not careful. We should take care to avoid this. For example, the target can make sure to replicate the "slot complete" command to its replicas before bumping the epoch distributing pong to the rest of the cluster. The source node shouldn't give up the slot ownership until the target has done this.

We have a forkless blocking approach internally, as well as a fork based one, and they have different trade offs. The forkless approach involves incrementally iterating through all of the keys, locking them, handing them to background threads to serialize and send them. This starts to have issues with larger keys, as the keys become blocked will also block access on the main Redis thread.

Good idea. It locks one key, serializes it and then unlocks it, right? If it marks the key with some flag, or adds it to some temporary dict to keep track, we can know for which keys we need to accumulate writes.

Locking a single key to serialize it doesn't seem like a problem to me. Many commands do that, e.g. DEL or COPY or MIGRATE for that matter.

It also requires a lot of intelligent parsing to understand, "what keys will a command access", in order to know to block it. I think we have the expertise to help implement that, but there is a lot of nuanced edge cases there.

Oh, but can't we just serialize it in the main thread and then unlock the key? No need to block any command. Only some per-key flag to know if we need to accumulate writes.

The fork based approach on the flip side, if much simpler to implement, but you pay the CoW cost as well as the initial fork time.

Good point. This would CoW not only the slot we're interested in but all slots and all memory.

madolson commented 2 years ago

Oh, but can't we just serialize it in the main thread and then unlock the key? No need to block any command. Only some per-key flag to know if we need to accumulate writes.

We could, the downside here is that we're now blocking the main thread on key serialization (which ironically we do anyway as part of slot migration, I'm surprised more people don't complain about this). It's worth noting that we also block the key loading on the target side as well.

If the target node takes ownership of a slot and then fails over before its replicas know that, then we may end up losing the whole slot if we're not careful. We should take care to avoid this. For example, the target can make sure to replicate the "slot complete" command to its replicas before bumping the epoch distributing pong to the rest of the cluster. The source node shouldn't give up the slot ownership until the target has done this.

That's a good point, I haven't looked at our internal implementation recently but we do something like this. Replicating the commands for slot migration helps a lot in making sure primary/replica stay in sync.

PingXie commented 2 years ago

I extended my changes for #10517 and created a prototype to implement this new CLUSTER MIGRATE command. The solution doesn't require forking the process nor extra locking. It is done in a way similar to the full sync flow and runs in the main event loop. The core idea is to move the entire slot migration protocol into redis-server and wrap it under the single command CLUSTER MIGRATE. At a high level, the solution works as follows:

  1. Admin executes CLUSTER MIGRATE SLOT <n> NODE <N>
  2. After validating all the parameters, a slotMigrationRecord struct is created and queued up
  3. clusterCron periodically checks all queued migration requests
  4. For each not-yet-started request, create a connection and set up a connection handler
  5. On connection-establishment, set up a read handler for the connection
  6. On read-ready, send AUTH if present and return control back to the event loop
  7. On read-ready and AUTH success, send SETSLOT IMPORTING to target and return control back to the event loop
  8. On read-ready and SETSLOT IMPORTING success, send SETSLOT MIGRATING to myself via a fake client and manually trigger the replication (see issue #10517)
  9. Send SELECT 0 to target and return control back to the event loop
  10. On read-ready and SELECT 0 success, set up a write handler for the connection
  11. On write-ready, start preparing RESTORE commands. There are a few things to consider: a. We should batch enough number key/value pairs so that we don’t invoke the send syscall too often. A value of 64KB could be a good starting point but is also configurable b. We shouldn’t accumulate too much data either because this step is performed as part of the main event loop so the more data processed the more impact on the customer workload. A value of 128KB could be a starting point but is also configurable c. If a key/value pair is too big (say bigger than 200MB), we have a choice to stop the migration to not risk OOM’ing Redis. Note that with the changes introduced by #10517, the slot in question will remain in a stable migrating state even after primary failures in either the source and the target d. The RESTORE commands are then sent synchronously to the target e. On RESTORE success, delete the keys migrated locally using the same fake client and manually trigger the replication f. If there is no key left in the slot, goto step 12. g. Otherwise, return the control to the event loop and wait for the connection to be ready for write again
  12. Set up a read handler for the connection, send SETSLOT NODE to target and return control back to the event loop
  13. On read-ready and SETSLOT NODE success, send SETSLOT NODE to myself via the same fake client and manually trigger the replication
  14. Complete the active slot migration record and log all performance metrics.
  15. The slot migration record will be kept for 10 mins after completion and can be examined by a new cluster command CLUSTER REPORTMIGRATION.
  16. Any errors encountered will cause a connection reset followed by a retry. After a max number of retries are attempted, the slot migration request will be completed with an error message.

I am curious to hear the thoughts and feedback from the community. I am planning to share the code after #10517 is merged to reduce the amount of diff but I don't mind preparing a larger PR if it helps.

zuiderkwast commented 2 years ago

@PingXie Your solution is probably simpler to implement than "atomic slot migration" but it only solves some of the problems discussed here. On failover during migration, the same inconsistencies can happen as when the old slot migration commands are executed by an external tool. A repair using redis-cli --cluster fix or equivalent is still needed in this case.

I'm in favour of this solution, if we make sure it is possible to extend it in the future to also solve these issues:

Send SELECT 0 to target and return control back to the event loop

Why?

start preparing RESTORE commands. There are a few things to consider: a. We should batch enough number key/value pairs so that we don’t invoke the send syscall too often. A value of 64KB could be a good starting point but is also configurable b. We shouldn’t accumulate too much data either because this step is performed as part of the main event loop so the more data processed the more impact on the customer workload. A value of 128KB could be a starting point but is also configurable

This is currently controlled by how many keys are migrated simultaneously using the MIGRATE command.

It would be nice to avoid adding configuration for this. It's better if it just works, especially if there is an idea of improving this later to make the config obsolete.

PingXie commented 2 years ago

On failover during migration, the same inconsistencies can happen as when the old slot migration commands are executed by an external tool. A repair using redis-cli --cluster fix or equivalent is still needed in this case.

I am confident that the (in)consistencies are already solved by #10517 (I have more updates next by the way), which is what this solution is based on.

The utopian world that I would like to get to (and getting there now) is that Redis will automatically resume any slot migration interrupted by failovers (manual or automatic). The only exception would be the large key-value pair because it risks OOM'ing the server. There though will be a config that defines what "large" means exactly. The path I am taking is (in the order of dependency)

  1. Add the shard-id support #10474
  2. Improve the current slot migration reliability #10517
  3. Add support for failover in empty shards. I have done this as part of the reliability improvement but I guess I will create a separate issue to track the work. It is not currently part of the PR for #10517 though.
  4. Implement basic single-shot cluster migration command (this issue)
  5. Automatically resume any slot migration interrupted by failover (working on it but haven't decided whether it should go on a different issue or not)

After completing the list, an admin can just fire-n-forget the slot migration request and be assured that Redis will complete the job even if there is any failure (again barring the large data guardrail).

No manual repair should be needed.

Yes. This is the plan.

Make inconsistencies impossible by keeping the migrated keys and accumulating updates until the whole slot is migrated and atomically transfer the ownership to the new owner.

Yes.

In case of failover of either the importing or the migrating node, the migration is rollbacked.

The large data is the only reason where I can see rollback might help but still I think it is best left to the admin. For interruptions caused by failover, auto roll-forward makes more sense IMO. Did I miss some use cases?

Send SELECT 0 to target and return control back to the event loop

That is the current migrate behavior but I think you are right we shouldn't need it for clusters.

It would be nice to avoid adding configuration for this. It's better if it just works, especially if there is an idea of improving this later to make the config obsolete.

Good point. I added them mostly for debugging/testing purpose. I think I can remove some of them if not all next.

zuiderkwast commented 2 years ago

@PingXie I don't think huge keys causing OOM when migrated is a huge risk. If it is, then the memory-usage-per-slot feature would come in handy, so the memory can be predicted before migration.

The only problem your solution doesn't solve AFAICT is that it blocks both Redis nodes synchronously, the same way MIGRATE does. That could possibly be improved separately though, in some way...

The roadmap you presented above looks like a good way forward. It's probably much less work than what the "atomic slot migration" ideas would imply. It would be nice to get some kind of approval of this roadmap from the core team. (@madolson? btw I unassigned you by mistake, sorry)

PingXie commented 2 years ago

The only problem your solution doesn't solve AFAICT is that it blocks both Redis nodes synchronously, the same way MIGRATE does. That could possibly be improved separately though, in some way...

Agreed on migration being synchronous with my proposal. IMO, this impact will be felt on two ends of the key/value size spectrum. For extremely small key/value pairs (<100 bytes combined), the long pole comes from the overhead associated with reading the RESTORE response (due to the number of read syscalls). For extremely large key/value pairs (>10s of MBs), the long pole would be dominated by the write overhead. Based on my quick measurements using a loopback network setup, I am seeing about 2-10ms of impact (pause) with small key/value pairs (10-100 bytes). The impact of larger key/value pairs (~100KBs) is surprisingly low (<2ms), however. Then the impact goes higher if the key/value goes even greater.

That could possibly be improved separately though, in some way...

Here is how I see we could remove the synchronous migration altogether in the future. The one and only key constraint that we need to live with is that we need to make sure the execution of the RESTORE and DELETE commands not be interleaved with the customer workload for the keys involved. This proposal in its current form essentially takes the crudest form of synchronization, a global lock achieved by running the migration logic on the same main event loop thread. However, it is completely possible that we move this migration logic to its own thread and then rely on more granular locking at either the slot level or even the key level, which is a good thing to have in general (IMO). I believe it is a natural extension of this proposal and can be added to the roadmap (after completing the auto-resume for slot migration).

I don't think huge keys causing OOM when migrated is a huge risk. If it is, then the memory-usage-per-slot feature would come in handy, so the memory can be predicted before migration.

That is a good point @zuiderkwast. I think it can be easily be incorporated once #10472 is completed - as long as we stay in the "synchronous" world. Depending on what synchronization granularity we land on when we move the migration logic to its own thread (for instance, slot-level vs key-level) we might or might not have an accurate reading but I would also agree that we don't need 100% accuracy.

madolson commented 2 years ago

@zuiderkwast Oh lol, I thought you were just being a little passive aggressive since I didn't do anything for 2 months :) (I feel like it was warranted, I've been poking people internally but never got someone to work on it yet.)

I'm not strictly against what was outlined, but I still am not sure why we wouldn't just move towards the atomic slot migration. As far as I have reasonably seen, most people agree that the atomic approach is strictly better. It resolves the issue about failing part way through the migration as well as resolves the fact many clients don't correctly implement the asking. I agree that this solution builds towards that, but given that we have time to build a better final approach, why wouldn't we take the chance now to build it?

Some other general thoughts on the outline:

Admin executes CLUSTER MIGRATE SLOT NODE

Any reason not to support multiple slots instead of just one? The original code was implemented with the expectation that it would be orchestrated primarily externally. We've added other user friendly functions to make it easier to handle a lot of slots like CLUSTER ADDSLOTSRANGE

On read-ready, send AUTH if present and return control back to the event loop

One of the real benefits of a background thread (or a fork) here is that we don't need so much event loop management. It can just synchronously do the operations. Not a major point though.

On read-ready and SETSLOT IMPORTING success, send SETSLOT MIGRATING to myself via a fake client and manually trigger the replication (see issue https://github.com/redis/redis/pull/10517)

I don't really get the fake client thing, why aren't we just setting the value and replicating it? Maybe fake client is just a wording choice.

On write-ready, start preparing RESTORE commands. There are a few things to consider:

a. We should batch enough number key/value pairs so that we don’t invoke the send syscall too often. A value of 64KB could be a good starting point but is also configurable

b. We shouldn’t accumulate too much data either because this step is performed as part of the main event loop so the more data processed the more impact on the customer workload. A value of 128KB could be a starting point but is also configurable

At AWS we usually use some type of latency threshold to determine what is the max amount instead of size, since latency is ultimately what we want to constrain. Having size values is usually less interesting.

c. If a key/value pair is too big (say bigger than 200MB), we have a choice to stop the migration to not risk OOM’ing Redis. Note that with the changes introduced by https://github.com/redis/redis/pull/10517, the slot in question will remain in a stable migrating state even after primary failures in either the source and the target

We found this to be highly problematic at AWS. The main problem is that most users don't know this a head of time, and will just have an abrupt failure mode. Given that we don't rollback, there is a large risk here. This was our entire motivation to move towards a background thread based approach instead (and much later to have a fallback for fork).

d. The RESTORE commands are then sent synchronously to the target

So if the target is busy handling traffic we will block traffic on the primary? This should ideally be asynchronous in some capacity.

The slot migration record will be kept for 10 mins after completion and can be examined by a new cluster command CLUSTER REPORTMIGRATION.

Weird command, maybe we just log all of this information? At Amazon we have a "last slot migration status" that lasts forever, which we use in the case of a failure. I'm not sure we should introduce a net new command for this functionality.

PingXie commented 2 years ago

@zuiderkwast to complete my thoughts on the large data challenge, we need to consider a case where the target node might not have enough space to hold the data. It can end up rejecting the RESTORE command, eventually failing the migration request (after exhausting retries). This is not exactly the same scenario we have been discussing so far where the source doesn't have enough memory to serialize the data. The point that I am trying to make here is that regardless how hard we try, there will be cases when the cluster ends up in a perpetual migrating state and some hard decisions have to be made with the help of a human operator, for instance, delete some keys, either from the same slot or a different slot. These are not the things that can be completely automated IMO. Then there is also the corner case where the migrating/importing states are out of sync between replicas and primary in the same shard (due to failures) since we don't have (and can't afford IMO) 2phase commit nor distributed consensus when updating the slot ownership.

madolson commented 2 years ago

@PingXie All of the points you mentioned are basically why we want to have an atomic slot migration, so we can rollback and fail it. We can afford a 2PC between a source and target. There are also plans to introduce a strong distributed consensus as part of a revamp of cluster mode, I've asked someone to share a document with you detailing that longer term plan. (It's maybe a Redis 8 thing)

PingXie commented 2 years ago

@madolson I am completely speculating here but from a glimpse on the thread, what you described sounds like a typical replication scheme where the data is being funneled to the "replica", which is the migration node here, while the source continues to serve the customer traffic. When the target catches up close to the source, you pause the clients on the source, transfer the last portion of the data, fix up the ownership, and then release the paused clients? I am not sure if fork is needed at the design level though (genuine question). If this is even remotely close, it sounds like we will be building up memory footprint on both nodes during the entire migration process. Would this increase failure rates under heavy workload? Anyways, I am sure I missed/misunderstood things. Can you share some more details on your "atomic" proposal so we can evaluate the two in depth side by side?

Any reason not to support multiple slots instead of just one? The original code was implemented with the expectation that it would be orchestrated primarily externally. We've added other user friendly functions to make it easier to handle a lot of slots like CLUSTER ADDSLOTSRANGE

Sure. I am open to multiple slots support. This can be easily incorporated if there is interest from the community.

One of the real benefits of a background thread (or a fork) here is that we don't need so much event loop management.

Yep. I shared some thoughts earlier at https://github.com/redis/redis/issues/2807#issuecomment-1145417460

I don't really get the fake client thing, why aren't we just setting the value and replicating it?

It is the same fake client used by AOF loading. It is an implementation choice and I guess we could do everything manually too. I am fine either way.

At AWS we usually use some type of latency threshold to determine what is the max amount instead of size, since latency is ultimately what we want to constrain. Having size values is usually less interesting.

Agreed that direct measurement of latency is better but I was under the impression that Redis is not using any high-resolution clock?

We found this to be highly problematic at AWS. The main problem is that most users don't know this a head of time, and will just have an abrupt failure mode. Given that we don't rollback, there is a large risk here. This was our entire motivation to move towards a background thread based approach instead (and much later to have a fallback for fork).

Thanks for the context. With the reliability fix, I am hoping that the cluster will be able to maintain this migrating state (for the slots in question) perpetually in the face of future failovers. The ASKING redirection is indeed not very good experience but I no longer think the cluster will easily fall apart due to the interruption, again with the reliability fix.

So if the target is busy handling traffic we will block traffic on the primary? This should ideally be asynchronous in some capacity.

I think there is a path toward async - again see my comments at https://github.com/redis/redis/issues/2807#issuecomment-1145417460

Weird command, maybe we just log all of this information? At Amazon we have a "last slot migration status" that lasts forever, which we use in the case of a failure.

Logging is given but a command will be useful when operators don't have direct or real time access to the logs. BTW, I just updated the logic to keep the last few completed requests (no longer time based). Curious - how does one find out the "last slot migration status" on AWS? Is it dumped to a log or part of some command output? INFO, perhaps?

It resolves the issue about failing part way through the migration as well as resolves the fact many clients don't correctly implement the asking.

I wonder if it makes sense to have the source node proxy this kind of requests instead of failing with ASKING?

PingXie commented 2 years ago

@madolson looks like our messages crossed. Looking forward to the doc!

madolson commented 2 years ago

Agreed that direct measurement of latency is better but I was under the impression that Redis is not using any high-resolution clock?

The monotonic clock is typically used for this purpose.

It resolves the issue about failing part way through the migration as well as resolves the fact many clients don't correctly implement the asking.

I've seen this as just a general ask for cluster mode, instead of throwing moved/redirect errors just handle it. That could be a good strategy for the time being.

I am not sure if fork is needed at the design level though (genuine question).

You're very right that we don't need a fork, I've proposed it since it since I think it's simpler than doing all the work in the main thread, or locking and doing stuff on background thread, given some of issues identified. I think the correct long term approach is key locking and doing work on other threads, but we are missing some foundational pieces that someone from AWS should hopefully open a PR for soon. (The issues are identified here, https://github.com/redis/redis/issues/7372#issuecomment-640956204). If people think it makes more sense we can do the 100% main thread approach too, I think both designs build towards the same long term solution.

madolson commented 2 years ago

If this is even remotely close, it sounds like we will be building up memory footprint on both nodes during the entire migration process. Would this increase failure rates under heavy workload? Anyways, I am sure I missed/misunderstood things.

We can also build up the foot print on replica. One choice that was made was to buffer the replication traffic on the source, but it can be buffered on the target instead. When the target has received the "restore" for a key, it can then start applying all of the buffered commands for that key. Generally since slots are all much smaller than the full dataset, this doesn't impose too much of a memory penalty.

In the main thread approaches, either with or without blocking, we know which keys have been sent, and we can inline operations between the outbound restore commands for keys that have been sent. For keys that have yet to be sent, we don't need to accumulate replication traffic since we will send the latest image of the key.

Can you share some more details on your "atomic" proposal so we can evaluate the two in depth side by side?

I'll see if I can get something in place by early next week. We have it all implemented internally, but it's not in a form that is easy to share.

PingXie commented 2 years ago

@madolson taking a step back here and looking at the high level picture, these two proposals represent two fundamentally different mindsets and differ from each other in just one core area: determinism.

My proposal is an evolution of the existing slot migration protocol, which is inherently non-deterministic. I believe my work in this area would lead to significant improvement of the reliability, performance, and usability when it comes to scaling out the cluster. That said, despite my best efforts, I agree that I can never say this solution works 100% reliably, at the design level.

Your/AWS proposal on the other hand is going after the core limitation of the existing slot migration protocol. With 2PC/distributed consensus, I am sure your solution will be deterministic, at the design level.

Other issues like multi-slot support, sync vs async, reporting mechanism, etc are all tangential IMO and I am sure either approach would have no problem to achieve, with reasonable amount of engineering efforts.

Now with

  1. You/AWS looking at Redis 8.0 (earliest I guess?) and me shooting for 7.2
  2. Existing slot migration protocol not working well nor going anywhere
  3. No coupling of any kinds between the two proposals

I actually don't see us making a binary decision.

So would it make sense to explore/support both of them in parallel? Curious to hear if this is what the community is interested in.

PingXie commented 2 years ago

That said, despite my best efforts, I agree that I can never say this solution works 100% reliably, at the design level

I wanted to clarify that there is an unspoken assumption behind the statement above, which is we don't change how we manage the migrating states today (other than replicating them to replicas asynchronously). If we introduced 2PC and managed the update of migrating states (migrating/importing/slot-ownership) as a distributed transaction as well, I think we could solve the (lack of) determinism issue. It is just that my proposal would not be considered as an extension of the existing slot migration protocol when we got here.

madolson commented 2 years ago

You/AWS looking at Redis 8.0 (earliest I guess?) and me shooting for 7.2

We're taking a fresh look at cluster mode for Redis 8, but I'm still hoping to get the atomic slot migration in 7.2.

Existing slot migration protocol not working well nor going anywhere

Honestly, I think this is one of the reasons a lot of people run cluster mode on managed services, since it's such a pain to self-manage outside of very small configurations. There is some tooling in the CLI to make it less painful.

No coupling of any kinds between the two proposals

This is a fair point. The only real downside I see is that we would have two mechanisms for building slot migration into the engine, that do basically the same thing. Historically we've use the redis-cli as the utility for helping to facilitate migrations. My personal preference is that we should fix the known issues with slot migration around setting slot state. This will should make the migrations with redis-cli demonstrably safer. In parallel we can build a real atomic slot migration that doesn't rely on the existing mechanisms.

PingXie commented 2 years ago

My personal preference is that we should fix the known issues with slot migration around setting slot state. This will should make the migrations with redis-cli demonstrably safer. In parallel we can build a real atomic slot migration that doesn't rely on the existing mechanisms.

So to map your comments to the below roadmap of mine (evolution of the existing slot migration protocol), I believe you are saying that you agree with the engine work all the way to step 3 but your preference is atomic slot migration over step 4, 5 and 6?

  1. Add the shard-id support (#10474)
  2. Improve the current slot migration reliability (#10517)
  3. Add support for failover in empty shards. (To be filed/submitted)
  4. Implement basic single-shot cluster migration command (#2807)
  5. Automatically resume any slot migration interrupted by failover (Proposed)
  6. Async migration (off of the main thread) (Proposed)
madolson commented 2 years ago

@PingXie I generally agree with that.

madolson commented 2 years ago

Didn't get the doc does this week as I promised, but it's almost done, early next week!

hwware commented 2 years ago

I am not sure if it could support setslot in a range instead of only accepting one slot so far, the possible format is CLUSTER SETSLOT startslot endslot IMPORTING node-id | MIGRATING node-id | NODE node-id | STABLE

Thanks

zuiderkwast commented 2 years ago

The doc mentioned above is in #10875, preview here: https://github.com/redis/redis/blob/76283c116d91fc3c15e42bbedbcceb8d2ee51938/CLUSTER.md

It appears that atomic slot migration is orthogonal to the new cluster protocol:

It is expected that we will have atomic slot migration (based on fork+serialize like Redis replication) in Redis before Flotilla is GA - in which case Flotilla will only support that and not the existing key migration flow.

Doesn't this mean that we should start implementing it ASAP? I think we can divide it in multiple smaller steps. How about starting with separate things like implementing a CLUSTER PURGESLOT command? (Non-atomic delete/unlink all keys in a slot not owned by the node.)

Btw, forking for one slot migration has serious CoW impact, while locking/serializing one key at a time in the main thread is acceptable IMO, so I think the threaded approach where the migration thread requests one key at a time from the main thread over a pipe, etc., is preferable and also not that complex.

PingXie commented 2 years ago

It appears that atomic slot migration is orthogonal to the new cluster protocol

I also agree these are two orthogonal decisions. In fact, after reviewing #10875, I start thinking that slot migration is not an integral part of the cluster V2 design, which in theory can work with either solution, atomic and non-atomic. If we can agree on the high level command format (proposal: CLUSTER MIGRATE SLOT <slot> [NODE <node_id> | SHARD <shard_id>] [USER <user>] [PASSWORD <password>]), the solutions are largely interchangeable.

It also helps reduce the risk if we could avoid picking on too many big bets.

I will share my implementation once after I merge the dependent PRs (#10474 #10517).

I am not sure if it could support setslot in a range instead of only accepting one slot so far,

This is actually more of a packaging decision than something that affects the core design (for either solution). I am open to community suggestions. If there is a compelling use case, happy to provide the semantics. Otherwise, my preference is to keep the interface simple.

madolson commented 2 years ago

Here is the more detailed design being proposed for atomic slot migration: https://github.com/redis/redis/issues/10933