valkey-io / valkey

A flexible distributed key-value datastore that supports both caching and beyond caching workloads.
https://valkey.io
Other
16.83k stars 625 forks source link

[NEW] WAIT ALL #743

Open zuiderkwast opened 3 months ago

zuiderkwast commented 3 months ago

I want to revive the feature suggested in https://github.com/redis/redis/pull/8684.

WAIT ALL in a cluster would provide a real consistency. Without waiting for all replicas, there's a risk that the one replica that didn't replicate all writes will win the failover election.

A problem with mentioned PR is that WAIT ALL would succeed if all replicas are gone, e.g. during a netsplit. In a cluster, we can avoid this problem by looking at the cluster structures rather than server.replicas. A replica that's gone missing is still present in the cluster until admin calls CLUSTER FORGET.

Can we do something similar for Sentinel? I'm not familiar enough with sentinel. Perhaps the sentinel knows and can tell the primary how many replicas it's supposed to have(?).

I care less about standalone primary-replica setups since they don't have the same guarantees anyway. I'd be fine with just counting server.replicas in this case (though some delayed decrement of the replica count would mitigate this a bit, perhaps). A manual config for the expected number of replicas can also be acceptable.

enjoy-binbin commented 3 months ago

Agree with this, i want to do that. Another thing i want to deal with is, if we pass a large numreplicas, can we implicitly consider it as WAIT ALL? (Or to be honest I think it could also return an error, since it will obviously just timeout, i don't think that waiting for a replica that will be online in the future is a normal user request.)

I have encountered such a problem before. Redisson sent a wait numreplicas timeout command, where the value of numreplicas was much larger than the actual number of replicas, causing the client to be blocked and timeout (and return an error).

There are several reasons for the value being much larger.

  1. There are problems with the client code, such as https://github.com/redisson/redisson/issues/1304, which counts multiple replicas.
  2. Or the cluster slots returned in the proxy implementation are not standard.

I remember we had such a problem with sentinel. The failover it triggered would not pause the primary, causing the new replica to lose some data during the failover. I want to deal with this before, but i haven't found the time yet.

ranshid commented 3 months ago

I like this proposal very much as it greatly simplifies existing scenarios for clients.

I think, though, that it should be made very clear what is the guarantee of this new API. For example what happen in case of syncing replica? A replica which is waiting for BGSAVE or consuming the RDB will not receive a getack request from the primary so it should not report back to the server thus wait will timeout right? Maybe it should be made clear that we will only wait for the "ACTIVE" replicas which are in the process of steady state syncing or psync?

ps. I also noticed that there was some proposal to have a strict configurable value of "num or replicas to sync". We should be able to handle different scenarios of upgrade cluster, where sometimes additional replicas are added in order to later failover to.

enjoy-binbin commented 3 months ago

Maybe it should be made clear that we will only wait for the "ACTIVE" replicas which are in the process of steady state syncing or psync?

Yeah, i think we should aim for "ACTIVE" replicas. But it is also hard to say, if the replica happens to be disconnected due to the network, does WAIt need to wait unit it comes online?

We should be able to handle different scenarios of upgrade cluster, where sometimes additional replicas are added in order to later failover to.

So when we removing a replica, change the "value" of the majority / all? or we can dynamically calculate the current majority / all value each time instead of relying on the value calculated at the very first WAIT command.

zuiderkwast commented 2 months ago

if the replica happens to be disconnected due to the network, does WAIt need to wait unit it comes online?

Yes it does. Otherwise it's just best effort. What I'm interested in is a real guarantee. It would make valkey cluster a database with real concistency. Then, you can store important data and not just cache and session data.