Issue with #533 - Githubissues

spotahome / redis-operator

Redis Operator creates/configures/manages high availability redis with sentinel automatic failover atop Kubernetes.

Apache License 2.0

1.5k stars 356 forks source link

Issue with #533 #535

Closed samof76 closed 1 year ago

samof76 commented 1 year ago

@ese there seems to be inherent issue with this #533

Before applying check and heal wait for all expected pods up and running instead wait only for exists to let Kubernetes controllers do their job

Consider this scenario....

The master and sentinel pods are running
The master pod and setinels get killed.
Those pods are unable to get scheduled.

In this case the check-and-heal would not do what its intended to do.

Consider another scenario...

The master and sentinel pods are running
All of the sentinel get killed and along with a slave
Now the sentinels get scheduled but on slave is still node scheduled.

In this case the check-and-heal would not configure the sentinels because of the fix here.

samof76 commented 1 year ago

Looks like #536 might fix this.

ese commented 1 year ago

First scenario: IMHO if the Redis master and more than N/2 sentinels are deleted the cluster is effectively broken until they can be scheduled and running again. Performing actions by operator during that period is quite dangerous because at the end we rely on sentinels to maintain the quorum in the long term.

Second scenario: Make sense not wait to have all redis replicas available to reconcile the sentinels with the existing master.

I don't think #536 resolve this, thanks @samof76

AndreasSko commented 1 year ago

We might have faced a scenario similar to the one described here which resulted in an outage of our Redis cluster. We run 3 Redis and Sentinel pods each. In our case, some of the pods were rescheduled on new nodes, but one Sentinel pod failed to properly terminate (e.g. the logs indicate that it shut down successfully, but it stayed in the deployment; we are still trying to figure out what exactly happened here). After this, the operator failed to configure the rest of the Redis and Sentinel pods and just logged:

Number of sentinel mismatch, waiting for sentinel deployment reconcile

Probably our setup would have continued to work if the redis-operator had been able to configure the remaining pods.

I will try to reproduce our scenario but wanted to already note my initial findings here :slightly_smiling_face:

AndreasSko commented 1 year ago

One update: I was able to reliably reproduce this issue by triggering an eviction of a sentinel pod (for example by allocating a huge file via fallocate -l 100G big.file and forcing an eviction by kubelet). In this case there will be one Completed Sentinel pod and the redis-operator will wait "for sentinel deployment reconcile". If in the meantime the other sentinel pods are restarted, the whole cluster will fall apart as nothing is getting configured anymore.

samof76 commented 1 year ago

@AndreasSko were you able to reproduce this with the latest operator version?

AndreasSko commented 1 year ago

Yes, we are running v1.2.4

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 45 days with no activity.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

AndreasSko commented 1 year ago

Unfortunately, the issue is still happening with our system and has resulted in a couple of outages 😅 Would it be possible to re-open it?