Open ag-TJNII opened 4 months ago
I like this idea! I've seen issues in the past, especially with cluster mode, where a node ends up being marked as a primary with no data. It seems like it should likely be part of the handshake that the replica has data and expects this primary to still have data unless it sees the same replication ID (indicating the primary did an intentional flushall of its own data).
We will need to document the behavior in such a way that an admin knows that the cluster might not self-recover and may require operator intervention.
We will need to document the behavior in such a way that an admin knows that the cluster might not self-recover and may require operator intervention.
For the real-world cases I've been testing where the nodes come up empty I'd expect the new node to be promoted to the primary (either via Sentinel or via clustering), and the old node that came up empty should be reconfigured to be a replica and begin restoring the dataset from another node. In my testing this happens as long as I prevent the original replica(s) from clearing their data out via bad replication before the failover happens. I'd think when replication is reconfigured to change which nodes are primaries and which are replicas that would flush out the old primary state data that could wedge up the cluster, but perhaps there's detail and nuance here you know of that I don't.
I'd think when replication is reconfigured to change which nodes are primaries and which are replicas that would flush out the old primary state data that could wedge up the cluster, but perhaps there's detail and nuance here you know of that I don't.
It's really just that an operator could have caused the flush. I think we might be able to detect this case as well. In either case, I'm going to through this in the roadmap.
This option will always break the replication and require admin intervention, and until then the setup will be running in non-HA mode.
The core problem is that the primary didn't stay "down" long enough to trigger the election reliably. I think a correct solution should explore the below instead
In the cluster case, 1) can be achieved via nodes.conf and cluster_node_timeout can be used in 2). My knowledge about sentinel is limited but I guess sentinel would need to be responsible for both 1) and 2)
The core problem is that the primary didn't stay "down" long enough to trigger the election reliably.
This is an unsolvable problem since we prioritize availability over consistency, and your proposal is only a partial solution over a subset of the problems, and introduces new race conditions and more failure modes for operators to worry about. It's always possible the node with data becomes apparently unavailability, so we prioritize electing a node without data. This is somewhat shifting away from availability towards consistency from CAP.
This is an unsolvable problem since we prioritize availability over consistency,
If we apply the argument of "prioritizing availability over consistency" to this situation, we shouldn't take this proposal but instead let the data flush happen. On the contrary, what is proposed here is muddying that stance even further. Users have some availability but not at the same level as they would expect (HA is lost). They also have no easy way to recover the data in the replica since it runs on a different replid
. This proposal if implemented will always require admin intervention to fix the now stranded replicas.
I understand that the project was built with a bias towards availability but I think it has gone both too far and is also done inconsistently. A bit of generalization (since we are discussing replication here) but this practice in general is the reason why the cluster is really hard to manage and why we are still fixing fundamental bugs in it.
[Deleted]
I just realized the case I was mentioning only applies to how AWS does it, so that's just a problem that we have. I think Ping is right that you just need to better protect nodes are coming back empty.
We prioritize availability over consistency to some extent, and I think that's fine as long as the impact is limited. It's fine that we use asynchronous replication and that we can lose a few writes when a primary crashes. WAIT-ing for all replicas can prevent the write from being lost in that case.
It's not OK though to lose the whole data set. I want the cluster to prevent this from happening without manual intervention. Ping's ideas make sense to me.
If I understand this correctly, the primary needs to lose its data while keeping its nodes.conf so it still has its place in the cluster. If so, I can't see why we can store some more info in nodes.conf to prevent this from happening.
Another possibility: If the replicas can detect that the primary came back empty-handed, they could escalate/gossip this to the cluster in some way to mark the node as FAIL and trigger a failover.
I think in the standlone mode, we already solve this problem through the parameter master-reboot-down-after-period mymaster in sentinel.conf
The steps:
In real production, the gap between step 3 and step 4 can not be too long, otherwise user can not accept, In fact, the gap between step 3 and 4 is reboot time, so master-reboot-down-after-period timing should solve it because replica could not be failover.
@hwware I don't understand how master-reboot-down-after-period can solve this in a Sentinel setup. If the primary has lost its data, it comes up fast. If it still has its data, it takes more time to loads its data and be available. It's the first case we want to prevent here.
@hwware I don't understand how master-reboot-down-after-period can solve this in a Sentinel setup. If the primary has lost its data, it comes up fast. If it still has its data, it takes more time to loads its data and be available. It's the first case we want to prevent here.
in the step 4, when node 1 come back without its db, master node has one signal "reboot". If master node has no "reboot" signal, it means 2 cases:
Maybe master-reboot-down-after-period parameter can not solve all problems as in top comment, but at least it could solve subset of it.
Hi, just an idea here. I think the solution could be a more general one. Server config could have a: check-db-exists yes/no And refuse to start if the rdb file is missing.
Hi, just an idea here. I think the solution could be a more general one. Server config could have a: check-db-exists yes/no And refuse to start if the rdb file is missing.
No, this is opposite to the intent of the request. I do not want to have to manually set flags or toggle settings after bootstrapping. Furthermore it wouldn't work at all in the case without persistent storage documented in the original ask, as the nodes always come up without a rdb file.
The ask I need is actually quite simple: If a follower was replicating a dataset and the primary comes back empty, don't continue replication. As replication tracks offsets I believe this should be something as simple as:
if (( replication_disable_replicating_behind_primary == 1 ) && ( remote_offset < local_offset )) {
log("Refusing to replicate from lagging primary");
return;
}
This option will always break the replication and require admin intervention, and until then the setup will be running in non-HA mode. The core problem is that the primary didn't stay "down" long enough to trigger the election reliably.
Sentinel will trigger a election in the case I've described. The problem is, by the time sentinel has triggered the election, the replicas have replicated the empty dataset and we've lost the data. Relying on timing is hacky at best, the restart timing into account shouldn't be a factor in whether the dataset gets dropped. As for the argument that availability is favored over consistency, having my DB drop when the primary restarts defeats the purpose of HA, at that point I should just run a single node as I'm going to lose all the data in either case. This is the wring side of the availability / consistency balance, there are plenty of cases where having a minute or two of downtime is better than losing all the data.
So for my real-world problem case I think this simple block config flag I mentioned above will fix it. Making it a config flag should allow it to be turned on/off where needed to avoid any side-effects. I'll see about making a proof-of-concept branch, pointers to the relevant code would be appreciated as I've never worked in this codebase before.
Have you all tried herbs?
It works I stopped using western medication since a friend introduced me to it and I got cured of cancer after using the medications, it was cheaper and more effective. you could try it too and I am sure you will find cure to your illness and solutions to different virus 👇👇 https://www.facebook.com/103770562521545
The Problem
When running replication in environments without persistent storage it's possible for the following to occur:
To reproduce
Expected behavior
The replica should have detected that:
Additional information
This is documented at https://valkey.io/topics/replication/ under
Safety of replication when master has persistence turned off
, which is why I'm filing this as a feature, not a bug. I'm assuming it works this way for gestures vaguely reasons, but it seems like adding an option to disable replication if the primary's ID changes or if their offset goes backwards would be trivial and that the current behavior is dangerous.Aside, that page says "For example the master can restart fast enough for Sentinel to not detect a failure, so that the failure mode described above happens." That's a nice way of saying "this implementation has a glaring race condition and flaw"... If the reliability of a replication process is reliant on instances not rebooting quickly that's a pretty huge sign something is wrong, but I digress...
The only workaround I'm aware of for this is to enable ACLs for replication, and configure the replication user to start disabled. This requires another process to assert that the primary is actually the primary and enable the user.
Please either break this behavior and make it fail safe, or add an option to fail safe. Thanks.