helm / charts

⚠️(OBSOLETE) Curated applications for Kubernetes
Apache License 2.0
15.49k stars 16.79k forks source link

[redis-ha] Redis becomes unavailable after a series of pod failures #10635

Closed panos1208 closed 5 years ago

panos1208 commented 5 years ago

Is this a request for help?: No

Is this a BUG REPORT or FEATURE REQUEST? (choose one): I think both

Version of Helm and Kubernetes: Helm version: 2.10 Kubernetes version: 1.10

Which chart: redis-ha

What happened: A simple series of pod failure can lead this redis configuration to go into a deadlock during which redis is not available.

What you expected to happen: Considering that this is a chart for redis high-availability, I would expect it to resist simple pod failures.

The key issue comes from redis itself, specifically in the master election process done by sentinels. Each sentinel will remember all the previous sentinels that have ever existed (until a reset command is issued), even when sentinel pods have been deleted and are thus no longer accessible. The election process requires a majority of all sentinel seen so far. If there was enough pod failures, the list of all sentinel will grow large and eventually the number of live sentinel will be lower than the expected majority. To recover from this issue one must issue a reset command to all sentinels

I am wondering if it possible for this chart to provide an automated way to issue this reset command.

How to reproduce it (as minimally and precisely as possible):

Basic set up:

Assume there is the basic redis-ha set up of 3 pods P1, P2, P3 each one hosting two containers [M1, S1], [R1, S2], [R2, S3] where the M1 is the master and S1 its sentinel, R1 is the replica and S2 its sentinel e.t.c. all under MASTERGROUP mymaster. When queried all sentinels report the same number of available sentinels which is 3.

Failure scenario:

When a fail in M2 occurs and the pod restart then the sentinels S1 and S3 will now report 4 available sentinels where the S2 will report 3 available sentinels. If two more fail occurs on S2 (for the sake of simplicity) again then S1 and S3 will report 6 available sentinels. In this case, both S1 and S3 will fail since they cannot gather majority to elect a leader (3 available sentinels where 4 required) and, as a result, after some time S2 will also fail for the same reason

Anything else we need to know:

This is related to the following issues:

Thank you in advance.

ssalaues commented 5 years ago

@panos1208 I made some changes recently in PR https://github.com/helm/charts/pull/10032 that specifically address this issue. Basically it fixes the issue you are describing by persisting the Sentinel ID even after restart/delete/etc this way the number of sentinels doesn't artificially grow.

I would suggest trying the latest chart version (3.1.1 at the time this posting).

We have a good community working on improving this chart and I appreciate you opening this issue as this was definitely a big problem that was causing instability. There is another good PR to keep an eye on https://github.com/helm/charts/pull/10178 which is in review and will also add more stability/reliability to this chart.

panos1208 commented 5 years ago

Thank you very much for your response. The new chart seems to be working fine.

towolf commented 5 years ago

it fixes the issue you are describing by persisting the Sentinel ID

@ssalaues if I have no persistent storage, and /data is only an emptyDir, then this mechnism does not work, right?

So, your fix only works for cases where a PVC is mounted?

ssalaues commented 5 years ago

@towolf Yeah you are correct. The emptyDir wouldn't persist after pod deletion so the sentinel ID wouldn't persist unfortunately.

If there's a need for it, it could be possible to setup a redis-cli sentinel reset at init when persistent storage is disabled to clear out the old sentinels. https://redis.io/topics/sentinel#adding-or-removing-sentinels

towolf commented 5 years ago

@ssalaues so if emptyDir is the default mode, we should definitely add a mechanism that makes failover work as best as we can, no?

We do not attach PVC to redis-ha and I wonder what the effect will be.

These entries with port 0 are all phantoms now, yes? And they are taken into account for quorum?

# Generated by CONFIG REWRITE
port 26379
sentinel leader-epoch mymaster 1
sentinel known-replica mymaster 10.109.137.103 6379
sentinel known-replica mymaster 10.109.175.4 6379
sentinel known-sentinel mymaster 10.109.137.103 0 3a8e05cac935a11f8daa20fa73ce0569074ce7ea
sentinel known-sentinel mymaster 10.109.175.4 26379 6568ae30a8e7651a4cf8c82e234f19c6f55aa5e9
sentinel known-sentinel mymaster 10.109.137.103 26379 0c909685c322999f5eefeb874763600e139696b5
sentinel known-sentinel mymaster 10.109.137.103 0 070df4d8a5b31cf1925da02e74559e59dd2d3ddb
sentinel known-sentinel mymaster 10.109.137.103 0 bfc3f9f304e2f7f17cdae408dd2a272ca1a36706
sentinel known-sentinel mymaster 10.109.137.103 0 ed800cfc49956fa07178f14304b206d68d5b9f33
sentinel current-epoch 1
sentinel announce-ip "10.109.187.108"
sentinel announce-port 26379
towolf commented 5 years ago

@ssalaues can't we hard-code the 3 sentinel-ids somehow in the Helm chart, or generate randomly during chart templating?

Then during startup redis-ha-server-0 writes the 1st id to its sentinel.conf, redis-ha-server-1 takes the 2nd id, etc.

In such a way the sentinel IDs will be stable?

towolf commented 5 years ago

@ssalaues How about something like this?

{{/*
Pre-generate a Sentinel ID for each replica
Set of labels combined with replica index yields 40 hex char ID
*/}}
{{- define "redis-ha.stable-sentinel-ids" -}}
{{- $replicas := int .Values.replicas -}}
{{- range $i := until $replicas -}}
SENTINEL_ID_{{ $i }}: {{ printf "%s\nindex: %d" (include "labels.standard" $) $i | sha1sum }}
{{ end -}}
{{- end -}}

This results in

data:
  SENTINEL_ID_0: 40c017e046f7dcd6c971642d3723e754214f98d9
  SENTINEL_ID_1: 876c7f457e0313bc63a428fca8a0d4aa1ea1749c
  SENTINEL_ID_2: 2ba00feb2b857d94dbd61e95fc207719e6ec3ad6

  redis.conf: |
    dir "/data"
    maxmemory 0
    maxmemory-policy volatile-lru
    min-slaves-max-lag 5

And we can use this to write myid into the sentinel.conf?

ssalaues commented 5 years ago

@ssalaues so if emptyDir is the default mode, we should definitely add a mechanism that makes failover work as best as we can, no?

We do not attach PVC to redis-ha and I wonder what the effect will be.

These entries with port 0 are all phantoms now, yes? And they are taken into account for quorum?

@towolf emptyDir isn't the default mode currently which is why I hadn't prioritized it. However your idea to generate the IDs at templating might be the best solution for everyone. Definitely worth trying out to see if sentinel will accept a generated ID as such.

Also yes all the "port 0" entries are the ghosts unfortunately.

In such a way the sentinel IDs will be stable?

The way you have it should keep the IDs stable

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

devsinghnaini commented 5 years ago

@panos1208 I made some changes recently in PR #10032 that specifically address this issue. Basically it fixes the issue you are describing by persisting the Sentinel ID even after restart/delete/etc this way the number of sentinels doesn't artificially grow.

I would suggest trying the latest chart version (3.1.1 at the time this posting).

We have a good community working on improving this chart and I appreciate you opening this issue as this was definitely a big problem that was causing instability. There is another good PR to keep an eye on #10178 which is in review and will also add more stability/reliability to this chart.

I am using the new chart 3.2.0 and I am facing the same issue after deleting the pods. When I am setting the value to the redis after shutting down the pods it is showing (error) NOREPLICAS Not enough good replicas to write. Can anyone help on this

djchapm commented 5 years ago

Ditto... @towolf - still not working. I have the sentinel IDs and I am not doing anything fancy beyond out of the box helm chart downloaded yesterday - cluster starts fine but today I'm unable to identify proper master via sentinels. Unable to connect error so it's like the IPs went stale and they are not able to see the updates.

This IP thing with redis is REALLY not working well for a kubernetes deployment. Wish it would work of k8s dns services.

Any guidance on getting this cluster to be stable in a non-persistence k8s env would be great... lacking the expertise you all have so trying to figure it out with the guidance provided in readme is a bit rough.

towolf commented 5 years ago

@djchapm I can only say that it's been working much better for us after the recent changes.

We haven't gotten anymore NOREPLICAS, and we've done 2 full node maintenances, where every pod has been evicted multiple times.

ssalaues commented 5 years ago

@devsinghnaini Do you have any custom values that you can post here so I can try reproducing your issue?

@djchapm The newer versions of the chart (as of #10032) use stable IPs from individual pod Services and set it as the Announce IP for each pod. So with this they shouldn't have stale IPs. What version exactly are you using (check via helm ls) ?

devsinghnaini commented 5 years ago

@ssalaues Please find the values.yaml atttached: values.txt

Chart version= 3.2.0 Also I have removed the SHA1SUM from the statefullset as it was given error.

MMartyn commented 5 years ago

I would like to add that I am also seeing an issue where it sets the IP in the redis.conf/sentinel.conf as the pod IP instead of the announce service IP. For example, my pod-0 had the following in the sentinel.conf:

sentinel monitor redis-cluster {pod_ip} 6379 2

and redis.conf of:

slaveof {pod_ip} 6379

The logs for the init container are as follows:

Initializing config..
Attempting to find master
Can't ping master, attempting to force failover
Could not connect to Redis at {pod_ip}:6379: Operation timed out
Setting up defaults
Setting this pod as the default master
Updating sentinel config
Ready..

Also note that the pod IP it was using I believe was it's old pod IP.

ssalaues commented 5 years ago

@devsinghnaini I was able to install successfully with your values and version (minus some tabbing issues and references to "redis.name" instead of "redis-ha.name"). But I did not need to change the SHA checks and really you shouldn't need to either. It's possible that this change is linked to your issue as this is the mechanism used to trigger a rolling update of the pods whenever there is a ConfigMap change.

@MMartyn I also just tested a default install of the latest version (helm install --name redis-ha stable/redis-ha --version 3.3.0) and could confirm that the announce Service IPs are being used even after a series of pod deletions. What you're describing is the expected behavior of older versions. I would suggest checking your installed version and updating to the latest.

MMartyn commented 5 years ago

@ssalaues Really sorry about that. I could have sworn I was on the latest but you are correct, I was on 3.1.5

ssalaues commented 5 years ago

No worries! Glad it was a simple fix @MMartyn 😄

MMartyn commented 5 years ago

@ssalaues So I believe it is still an issue on the latest version and it appears to be because when you set the one redis pod as master it doesn't get it's announce ips set (ie doesn't run redis_update() on master) and then through a series of failovers the pod's ip is set to the slaveof on the other pods since it isn't advertising the service ip for itself.

Way to reproduce: Shell into original master after cluster comes up, force a couple failovers until you observe the pod ip in the redis.conf. I believe if the annouce ips are set on all of the pods then the issue would be solved.

MMartyn commented 5 years ago

Switching function to:

setup_defaults() {
        echo "Setting up defaults"
        if [ "$INDEX" = "0" ]; then
            echo "Setting this pod as the default master"
            redis_update "$ANNOUNCE_IP"
            sentinel_update "$ANNOUNCE_IP"
            sed -i "s/^.*slaveof.*//" "$REDIS_CONF"
        else
            DEFAULT_MASTER="$(getent hosts "$SERVICE-announce-0" | awk '{ print $1 }')"
            if [ -z "$DEFAULT_MASTER" ]; then
                echo "Unable to resolve host"
                exit 1
            fi
            echo "Setting default slave config.."
            redis_update "$DEFAULT_MASTER"
            sentinel_update "$DEFAULT_MASTER"
        fi
    }

fixes it for me. I'll open a PR for it.

ssalaues commented 5 years ago

@MMartyn I still haven't been able to reproduce this issue with various failover and even failing 2/3 pods at once (including the master). Regardless, I can see what you're talking about regarding the redis_update().

-- edit just saw all the changes in your PR looks good and will review

djchapm commented 5 years ago

Hey @ssalaues, thanks for response -I'm using recent charts... I believe the issue is related to persistence=false. Also I'm using helm template to install as opposed to helm direct so I don't have to install helm. Our ops team really locks down our access - no volume claims and unable to install helm fully due to permissions.