valkey-io / valkey

A flexible distributed key-value datastore that is optimized for caching and other realtime workloads.
https://valkey.io
Other
17.42k stars 656 forks source link

Sentinel split-brain after failover #1322

Open MuhammadQadora opened 6 days ago

MuhammadQadora commented 6 days ago

Describe the bug Deployment architecture: We are running a two replica deployment (one master.one replica) in the cloud on Kubernetes (Statefulset). Each pod in our deployment runs on a dedicated instance (VM). We have three pods: Node-0 Node-1 Sentinel-0 Both node-0 (master) and node-1 (replica) are running a Redis server and a Sentinel process, and the Sentinel-0 only runs a Sentinel process. We have a total of 3 Sentinels with a quorum of 2. In our pipeline we are running a test that does the following:

1 – restart node-0 (kubectl delete ---grace-period 60) (master) and check if data persists and node-1 became master. 2- restart sentinel-0 and check if data persist 3 – restart node-1 (master at this point) and check if data persisted and node-0 became master At this point (node-0 is master and node-1 is replica) 4 – stop all nodes (VM) (each pod runs on a node) 5 – start all nodes and check if data persists

During the steps (1-4) we are running a while loop that reads and writes to test zero downtime. Steps 1-4 go as expected, but after starting all instances again which start up in this order : Sentinel-0 -> node-0 and node-1 (at the same time, sometimes node-1 starts before node-0 and vice versa) we get the split brain issue, where Sentinel-0 says that node-1 is master and node-1 and node-0 say that node-0 is master. We tried waiting to see if there is eventual consistency but that is not the case.

A short description of the bug.

To reproduce Follow steps 1 to 5. Steps to reproduce the behavior and/or a minimal code sample.

Expected behavior The expected behavior is that there is eventual consistency if there is a disagreement.

Additional information The config files node.conf and sentinel.conf for the instances: node_and_sentinel.docx

The logs for the sentinels before stopping the VM's: before-node-0.log before-node-1.log before-sentinel-0.log

The logs for the sentinels after starting the VM's: after-node-0.log after-node-1.log after-sentinel-0.log

Any additional information that is relevant to the problem.