valkey-io / valkey

A flexible distributed key-value datastore that supports both caching and beyond caching workloads.
https://valkey.io
Other
16.39k stars 611 forks source link

[Feature] Add option to prevent replica from replicating a empty DB if the primary comes back empty #579

Open ag-TJNII opened 4 months ago

ag-TJNII commented 4 months ago

The Problem

When running replication in environments without persistent storage it's possible for the following to occur:

To reproduce

# Start a network namespace
docker run -d --name=valkey-ns registry.k8s.io/pause

# Start server 1
docker run -d --name=valkey-1 --rm -ti --net=container:valkey-ns valkey/valkey:7-alpine

# Start server 2
docker run -d --name=valkey-2 --rm -ti --net=container:valkey-ns valkey/valkey:7-alpine valkey-server --port 6380 --replicaof 127.0.0.1 6379

# Prove we're online and replicating
docker exec -ti valkey-1 valkey-cli get foo
(nil)
docker exec -ti valkey-2 valkey-cli get foo
(nil)
docker exec -ti valkey-1 valkey-cli set foo bar
OK
docker exec -ti valkey-2 valkey-cli get foo
"bar"
docker exec -ti valkey-2 valkey-cli info replication
# Replication
role:master
connected_slaves:1
slave0:ip=127.0.0.1,port=6380,state=online,offset=96,lag=0
master_failover_state:no-failover
master_replid:0cadcd122bc5a647f36a0e1a27f75c4bbbd94068
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:96
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:96

# Completely remove and restart 1
docker rm -f valkey-1
docker run -d --name=valkey-1 --rm -ti --net=container:valkey-ns valkey/valkey:7-alpine

# Read test key out of 2
docker exec -ti valkey-2 valkey-cli info replication
# Replication
role:master
connected_slaves:1
slave0:ip=127.0.0.1,port=6380,state=wait_bgsave,offset=0,lag=0
master_failover_state:no-failover
master_replid:ad8c13246bf60c259f189c7a9db2e23f86cdfe17
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:0

docker exec -ti valkey-2 valkey-cli get foo
(nil)

# Clean up
docker rm -f valkey-ns valkey-2 valkey-1
$ docker logs valkey-2
1:C 30 May 2024 22:21:26.898 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
1:C 30 May 2024 22:21:26.898 * oO0OoO0OoO0Oo Valkey is starting oO0OoO0OoO0Oo
1:C 30 May 2024 22:21:26.898 * Valkey version=7.2.5, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 30 May 2024 22:21:26.898 * Configuration loaded
1:S 30 May 2024 22:21:26.899 * monotonic clock: POSIX clock_gettime
                .+^+.
            .+#########+.
        .+########+########+.           Valkey 7.2.5 (00000000/0) 64 bit
    .+########+'     '+########+.
 .########+'     .+.     '+########.    Running in standalone mode
 |####+'     .+#######+.     '+####|    Port: 6380
 |###|   .+###############+.   |###|    PID: 1
 |###|   |#####*'' ''*#####|   |###|
 |###|   |####'  .-.  '####|   |###|
 |###|   |###(  (@@@)  )###|   |###|          https://valkey.io
 |###|   |####.  '-'  .####|   |###|
 |###|   |#####*.   .*#####|   |###|
 |###|   '+#####|   |#####+'   |###|
 |####+.     +##|   |#+'     .+####|
 '#######+   |##|        .+########'
    '+###|   |##|    .+########+'
        '|   |####+########+'
             +#########+'
                '+v+'

1:S 30 May 2024 22:21:26.901 * Server initialized 
1:S 30 May 2024 22:21:26.901 * Ready to accept connections tcp
1:S 30 May 2024 22:21:26.901 * Connecting to MASTER 127.0.0.1:6379
1:S 30 May 2024 22:21:26.901 * MASTER <-> REPLICA sync started
1:S 30 May 2024 22:21:26.901 * Non blocking connect for SYNC fired the event.
1:S 30 May 2024 22:21:26.901 * Master replied to PING, replication can continue...
1:S 30 May 2024 22:21:26.901 * Partial resynchronization not possible (no cached master)
1:S 30 May 2024 22:21:31.410 * Full resync from master: efef6f51100556f7d72c4bfa1c5e47afac364281:14
1:S 30 May 2024 22:21:31.411 * MASTER <-> REPLICA sync: receiving streamed RDB from master with EOF to disk
1:S 30 May 2024 22:21:31.411 * MASTER <-> REPLICA sync: Flushing old data
1:S 30 May 2024 22:21:31.411 * MASTER <-> REPLICA sync: Loading DB in memory
1:S 30 May 2024 22:21:31.413 * Loading RDB produced by valkey version 7.2.5
1:S 30 May 2024 22:21:31.413 * RDB age 0 seconds  
1:S 30 May 2024 22:21:31.413 * RDB memory usage when created 0.92 Mb
1:S 30 May 2024 22:21:31.413 * Done loading RDB, keys loaded: 0, keys expired: 0.
1:S 30 May 2024 22:21:31.413 * MASTER <-> REPLICA sync: Finished with success
1:S 30 May 2024 22:21:54.104 * Connection with master lost.
1:S 30 May 2024 22:21:54.104 * Caching the disconnected master state.
1:S 30 May 2024 22:21:54.104 * Reconnecting to MASTER 127.0.0.1:6379
1:S 30 May 2024 22:21:54.104 * MASTER <-> REPLICA sync started
1:S 30 May 2024 22:21:54.104 # Error condition on socket for SYNC: Connection refused
1:S 30 May 2024 22:21:55.067 * Connecting to MASTER 127.0.0.1:6379
1:S 30 May 2024 22:21:55.067 * MASTER <-> REPLICA sync started
1:S 30 May 2024 22:21:55.067 # Error condition on socket for SYNC: Connection refused
1:S 30 May 2024 22:21:56.075 * Connecting to MASTER 127.0.0.1:6379
1:S 30 May 2024 22:21:56.075 * MASTER <-> REPLICA sync started
1:S 30 May 2024 22:21:56.075 # Error condition on socket for SYNC: Connection refused
1:S 30 May 2024 22:21:57.082 * Connecting to MASTER 127.0.0.1:6379
1:S 30 May 2024 22:21:57.083 * MASTER <-> REPLICA sync started
1:S 30 May 2024 22:21:57.083 # Error condition on socket for SYNC: Connection refused
1:S 30 May 2024 22:21:58.089 * Connecting to MASTER 127.0.0.1:6379
1:S 30 May 2024 22:21:58.089 * MASTER <-> REPLICA sync started
1:S 30 May 2024 22:21:58.089 # Error condition on socket for SYNC: Connection refused
1:S 30 May 2024 22:21:59.097 * Connecting to MASTER 127.0.0.1:6379
1:S 30 May 2024 22:21:59.097 * MASTER <-> REPLICA sync started
1:S 30 May 2024 22:21:59.097 # Error condition on socket for SYNC: Connection refused
1:S 30 May 2024 22:22:00.101 * Connecting to MASTER 127.0.0.1:6379
1:S 30 May 2024 22:22:00.102 * MASTER <-> REPLICA sync started
1:S 30 May 2024 22:22:00.102 * Non blocking connect for SYNC fired the event.
1:S 30 May 2024 22:22:00.102 * Master replied to PING, replication can continue...
1:S 30 May 2024 22:22:00.103 * Trying a partial resynchronization (request efef6f51100556f7d72c4bfa1c5e47afac364281:97).
1:S 30 May 2024 22:22:05.486 * Full resync from master: 82ab8fdec8d14314831607c8f0d2d0d13d457c91:0
1:S 30 May 2024 22:22:05.487 * MASTER <-> REPLICA sync: receiving streamed RDB from master with EOF to disk
1:S 30 May 2024 22:22:05.487 * Discarding previously cached master state.
1:S 30 May 2024 22:22:05.487 * MASTER <-> REPLICA sync: Flushing old data
1:S 30 May 2024 22:22:05.487 * MASTER <-> REPLICA sync: Loading DB in memory
1:S 30 May 2024 22:22:05.490 * Loading RDB produced by valkey version 7.2.5
1:S 30 May 2024 22:22:05.490 * RDB age 0 seconds
1:S 30 May 2024 22:22:05.490 * RDB memory usage when created 0.92 Mb
1:S 30 May 2024 22:22:05.490 * Done loading RDB, keys loaded: 0, keys expired: 0.
1:S 30 May 2024 22:22:05.490 * MASTER <-> REPLICA sync: Finished with success

Expected behavior

The replica should have detected that:

Additional information

This is documented at https://valkey.io/topics/replication/ under Safety of replication when master has persistence turned off, which is why I'm filing this as a feature, not a bug. I'm assuming it works this way for gestures vaguely reasons, but it seems like adding an option to disable replication if the primary's ID changes or if their offset goes backwards would be trivial and that the current behavior is dangerous.

Aside, that page says "For example the master can restart fast enough for Sentinel to not detect a failure, so that the failure mode described above happens." That's a nice way of saying "this implementation has a glaring race condition and flaw"... If the reliability of a replication process is reliant on instances not rebooting quickly that's a pretty huge sign something is wrong, but I digress...

The only workaround I'm aware of for this is to enable ACLs for replication, and configure the replication user to start disabled. This requires another process to assert that the primary is actually the primary and enable the user.

Please either break this behavior and make it fail safe, or add an option to fail safe. Thanks.

madolson commented 4 months ago

I like this idea! I've seen issues in the past, especially with cluster mode, where a node ends up being marked as a primary with no data. It seems like it should likely be part of the handshake that the replica has data and expects this primary to still have data unless it sees the same replication ID (indicating the primary did an intentional flushall of its own data).

We will need to document the behavior in such a way that an admin knows that the cluster might not self-recover and may require operator intervention.

ag-TJNII commented 4 months ago

We will need to document the behavior in such a way that an admin knows that the cluster might not self-recover and may require operator intervention.

For the real-world cases I've been testing where the nodes come up empty I'd expect the new node to be promoted to the primary (either via Sentinel or via clustering), and the old node that came up empty should be reconfigured to be a replica and begin restoring the dataset from another node. In my testing this happens as long as I prevent the original replica(s) from clearing their data out via bad replication before the failover happens. I'd think when replication is reconfigured to change which nodes are primaries and which are replicas that would flush out the old primary state data that could wedge up the cluster, but perhaps there's detail and nuance here you know of that I don't.

madolson commented 3 months ago

I'd think when replication is reconfigured to change which nodes are primaries and which are replicas that would flush out the old primary state data that could wedge up the cluster, but perhaps there's detail and nuance here you know of that I don't.

It's really just that an operator could have caused the flush. I think we might be able to detect this case as well. In either case, I'm going to through this in the roadmap.

PingXie commented 3 months ago

This option will always break the replication and require admin intervention, and until then the setup will be running in non-HA mode.

The core problem is that the primary didn't stay "down" long enough to trigger the election reliably. I think a correct solution should explore the below instead

  1. A primary node needs to keep track of the number of replicas it has across restarts
  2. A primary node needs to be able to get to business immediately after (re)start if it has no replicas but needs to wait long enough to ensure election happens.

In the cluster case, 1) can be achieved via nodes.conf and cluster_node_timeout can be used in 2). My knowledge about sentinel is limited but I guess sentinel would need to be responsible for both 1) and 2)

madolson commented 3 months ago

The core problem is that the primary didn't stay "down" long enough to trigger the election reliably.

This is an unsolvable problem since we prioritize availability over consistency, and your proposal is only a partial solution over a subset of the problems, and introduces new race conditions and more failure modes for operators to worry about. It's always possible the node with data becomes apparently unavailability, so we prioritize electing a node without data. This is somewhat shifting away from availability towards consistency from CAP.

PingXie commented 3 months ago

This is an unsolvable problem since we prioritize availability over consistency,

If we apply the argument of "prioritizing availability over consistency" to this situation, we shouldn't take this proposal but instead let the data flush happen. On the contrary, what is proposed here is muddying that stance even further. Users have some availability but not at the same level as they would expect (HA is lost). They also have no easy way to recover the data in the replica since it runs on a different replid. This proposal if implemented will always require admin intervention to fix the now stranded replicas.

I understand that the project was built with a bias towards availability but I think it has gone both too far and is also done inconsistently. A bit of generalization (since we are discussing replication here) but this practice in general is the reason why the cluster is really hard to manage and why we are still fixing fundamental bugs in it.

madolson commented 3 months ago

[Deleted]

I just realized the case I was mentioning only applies to how AWS does it, so that's just a problem that we have. I think Ping is right that you just need to better protect nodes are coming back empty.

zuiderkwast commented 3 months ago

We prioritize availability over consistency to some extent, and I think that's fine as long as the impact is limited. It's fine that we use asynchronous replication and that we can lose a few writes when a primary crashes. WAIT-ing for all replicas can prevent the write from being lost in that case.

It's not OK though to lose the whole data set. I want the cluster to prevent this from happening without manual intervention. Ping's ideas make sense to me.

If I understand this correctly, the primary needs to lose its data while keeping its nodes.conf so it still has its place in the cluster. If so, I can't see why we can store some more info in nodes.conf to prevent this from happening.

Another possibility: If the replicas can detect that the primary came back empty-handed, they could escalate/gossip this to the cluster in some way to mark the node as FAIL and trigger a failover.

hwware commented 3 months ago

I think in the standlone mode, we already solve this problem through the parameter master-reboot-down-after-period mymaster in sentinel.conf

The steps:

  1. Node 1 is the primary
  2. Node 2 is replicating node 1
  3. Node 1 drops
  4. Node 1 comes back without it's DB, in an empty state
  5. Node 2 reconnects and performs a full resync
  6. Node 2 is now empty, having propagated the data loss

In real production, the gap between step 3 and step 4 can not be too long, otherwise user can not accept, In fact, the gap between step 3 and 4 is reboot time, so master-reboot-down-after-period timing should solve it because replica could not be failover.

zuiderkwast commented 3 months ago

@hwware I don't understand how master-reboot-down-after-period can solve this in a Sentinel setup. If the primary has lost its data, it comes up fast. If it still has its data, it takes more time to loads its data and be available. It's the first case we want to prevent here.

hwware commented 3 months ago

@hwware I don't understand how master-reboot-down-after-period can solve this in a Sentinel setup. If the primary has lost its data, it comes up fast. If it still has its data, it takes more time to loads its data and be available. It's the first case we want to prevent here.

in the step 4, when node 1 come back without its db, master node has one signal "reboot". If master node has no "reboot" signal, it means 2 cases:

  1. master node just lost connection or can not response to other nodes due to busy status, but master node should hold data
  2. master node shutdown long enough time, i am not sure if client could tolerate master node long time not working.

Maybe master-reboot-down-after-period parameter can not solve all problems as in top comment, but at least it could solve subset of it.

tuxArg commented 2 months ago

Hi, just an idea here. I think the solution could be a more general one. Server config could have a: check-db-exists yes/no And refuse to start if the rdb file is missing.

ag-TJNII commented 2 months ago

Hi, just an idea here. I think the solution could be a more general one. Server config could have a: check-db-exists yes/no And refuse to start if the rdb file is missing.

No, this is opposite to the intent of the request. I do not want to have to manually set flags or toggle settings after bootstrapping. Furthermore it wouldn't work at all in the case without persistent storage documented in the original ask, as the nodes always come up without a rdb file.

The ask I need is actually quite simple: If a follower was replicating a dataset and the primary comes back empty, don't continue replication. As replication tracks offsets I believe this should be something as simple as:

if (( replication_disable_replicating_behind_primary == 1 ) && ( remote_offset < local_offset )) {
  log("Refusing to replicate from lagging primary");
  return;
}

This option will always break the replication and require admin intervention, and until then the setup will be running in non-HA mode. The core problem is that the primary didn't stay "down" long enough to trigger the election reliably.

Sentinel will trigger a election in the case I've described. The problem is, by the time sentinel has triggered the election, the replicas have replicated the empty dataset and we've lost the data. Relying on timing is hacky at best, the restart timing into account shouldn't be a factor in whether the dataset gets dropped. As for the argument that availability is favored over consistency, having my DB drop when the primary restarts defeats the purpose of HA, at that point I should just run a single node as I'm going to lose all the data in either case. This is the wring side of the availability / consistency balance, there are plenty of cases where having a minute or two of downtime is better than losing all the data.

So for my real-world problem case I think this simple block config flag I mentioned above will fix it. Making it a config flag should allow it to be turned on/off where needed to avoid any side-effects. I'll see about making a proof-of-concept branch, pointers to the relevant code would be appreciated as I've never worked in this codebase before.

Ralphbow commented 1 month ago

Have you all tried herbs?

It works I stopped using western medication since a friend introduced me to it and I got cured of cancer after using the medications, it was cheaper and more effective. you could try it too and I am sure you will find cure to your illness and solutions to different virus 👇👇 https://www.facebook.com/103770562521545