spotahome / redis-operator

Redis Operator creates/configures/manages high availability redis with sentinel automatic failover atop Kubernetes.
Apache License 2.0
1.5k stars 356 forks source link

rfr-redisfailover pods don't become ready #519

Closed vbode closed 1 year ago

vbode commented 1 year ago

Expected behaviour

Rollout redis succesfully

Actual behaviour

I am able to rollout redis-operator and a redis-failover config succesfully, but the rfr-redisfailover stateful set pods never become ready.

Steps to reproduce the behaviour

I deploy the redis-operator using the Helm chart. After I apply the following redis-failover attached. redis-failover.txt

Environment

How are the pieces configured?

Logs

The redis-operator gives the following error after applying the failover: time="2022-11-15T08:49:11Z" level=info msg="service updated" namespace=default service=k8s.service serviceName=rfr-redisfailover src="service.go:99" time="2022-11-15T08:49:11Z" level=error msg="error on object processing: Service \"rfs-redisfailover\" is invalid: [spec.clusterIPs[0]: Invalid value: []string(nil): primary clusterIP can not be unset, spec.ipFamilies[0]: Invalid value: []core.IPFamily(nil): primary ipFamily can not be unset]" controller-id=redisfailover object-key=default/redisfailover operator=redisfailover service=kooper.controller src="controller.go:279"

The rfr-redisfailover pods give the following error: 1:S 15 Nov 2022 09:37:41.760 Connecting to MASTER 127.0.0.1:6379 1:S 15 Nov 2022 09:37:41.760 MASTER <-> REPLICA sync started 1:S 15 Nov 2022 09:37:41.760 Non blocking connect for SYNC fired the event. 1:S 15 Nov 2022 09:37:41.760 Master replied to PING, replication can continue... 1:S 15 Nov 2022 09:37:41.760 Partial resynchronization not possible (no cached master) 1:S 15 Nov 2022 09:37:41.760 Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master

Note

This is very similar to: https://github.com/spotahome/redis-operator/issues/476

But the error in the redis-operator is different

lunika commented 1 year ago

We are experiencing the same issue. I can add some logs if it can help.

In all redis logs we have this:

1:C 23 Nov 2022 16:23:21.516 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo                                                                                                               │
│ 1:C 23 Nov 2022 16:23:21.516 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=1, just started                                                                              │
│ 1:C 23 Nov 2022 16:23:21.516 # Configuration loaded                                                                                                                                        │
│ 1:S 23 Nov 2022 16:23:21.517 * monotonic clock: POSIX clock_gettime                                                                                                                        │
│ 1:S 23 Nov 2022 16:23:21.519 * Running mode=standalone, port=6379.                                                                                                                         │
│ 1:S 23 Nov 2022 16:23:21.519 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.                           │
│ 1:S 23 Nov 2022 16:23:21.519 # Server initialized                                                                                                                                          │
│ 1:S 23 Nov 2022 16:23:21.519 * Ready to accept connections                                                                                                                                 │
│ 1:S 23 Nov 2022 16:23:21.519 * Connecting to MASTER 127.0.0.1:6379                                                                                                                         │
│ 1:S 23 Nov 2022 16:23:21.519 * MASTER <-> REPLICA sync started                                                                                                                             │
│ 1:S 23 Nov 2022 16:23:21.519 * Non blocking connect for SYNC fired the event.                                                                                                              │
│ 1:S 23 Nov 2022 16:23:21.519 * Master replied to PING, replication can continue...                                                                                                         │
│ 1:S 23 Nov 2022 16:23:21.519 * Partial resynchronization not possible (no cached master)                                                                                                   │
│ 1:S 23 Nov 2022 16:23:21.519 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master                                │
│ 1:S 23 Nov 2022 16:23:22.522 * Connecting to MASTER 127.0.0.1:6379                                                                                                                         │
│ 1:S 23 Nov 2022 16:23:22.522 * MASTER <-> REPLICA sync started                                                                                                                             │
│ 1:S 23 Nov 2022 16:23:22.522 * Non blocking connect for SYNC fired the event.                                                                                                              │
│ 1:S 23 Nov 2022 16:23:22.522 * Master replied to PING, replication can continue...                                                                                                         │
│ 1:S 23 Nov 2022 16:23:22.522 * Partial resynchronization not possible (no cached master)                                                                                                   │
│ 1:S 23 Nov 2022 16:23:22.522 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master 

In the sentinel we have these logs:

│ sentinel 1:X 23 Nov 2022 15:34:56.544 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo                                                                                                      │
│ sentinel 1:X 23 Nov 2022 15:34:56.544 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=1, just started                                                                     │
│ sentinel 1:X 23 Nov 2022 15:34:56.544 # Configuration loaded                                                                                                                               │
│ sentinel 1:X 23 Nov 2022 15:34:56.544 * monotonic clock: POSIX clock_gettime                                                                                                               │
│ sentinel 1:X 23 Nov 2022 15:34:56.545 * Running mode=sentinel, port=26379.                                                                                                                 │
│ sentinel 1:X 23 Nov 2022 15:34:56.545 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.                  │
│ sentinel 1:X 23 Nov 2022 15:34:56.548 # Sentinel ID is ef6989b563fc00f7850c7b5d64dfdae80f4520f0                                                                                            │
│ sentinel 1:X 23 Nov 2022 15:34:56.548 # +monitor master mymaster 127.0.0.1 6379 quorum 2                                                                                                   │
│ sentinel 1:X 23 Nov 2022 15:34:57.608 # +sdown master mymaster 127.0.0.1 6379 

and in the redis-operator pod we have these logs:

time="2022-11-23T16:30:10Z" level=error msg="Error while getting service rfr-redis-sentinel in preprod-d2rp namespace : services \"rfr-redis-sentinel\" not found" src="client.go:160"     │
│ time="2022-11-23T16:30:10Z" level=info msg="service updated" namespace=preprod-d2rp service=k8s.service serviceName=rfs-redis-sentinel src="service.go:99"                                 │
│ time="2022-11-23T16:30:10Z" level=info msg="configMap updated" configMap=rfs-redis-sentinel namespace=preprod-d2rp service=k8s.configMap src="configmap.go:84"                             │
│ time="2022-11-23T16:30:10Z" level=info msg="configMap updated" configMap=rfr-s-redis-sentinel namespace=preprod-d2rp service=k8s.configMap src="configmap.go:84"                           │
│ time="2022-11-23T16:30:10Z" level=info msg="configMap updated" configMap=rfr-readiness-redis-sentinel namespace=preprod-d2rp service=k8s.configMap src="configmap.go:84"                   │
│ time="2022-11-23T16:30:10Z" level=info msg="configMap updated" configMap=rfr-redis-sentinel namespace=preprod-d2rp service=k8s.configMap src="configmap.go:84"                             │
│ time="2022-11-23T16:30:10Z" level=info msg="podDisruptionBudget updated" namespace=preprod-d2rp podDisruptionBudget=rfr-redis-sentinel service=k8s.podDisruptionBudget src="poddisruptionb │
│ udget.go:85"                                                                                                                                                                               │
│ time="2022-11-23T16:30:10Z" level=info msg="statefulSet updated" namespace=preprod-d2rp service=k8s.statefulSet src="statefulset.go:108" statefulSet=rfr-redis-sentinel                    │
│ time="2022-11-23T16:30:10Z" level=info msg="podDisruptionBudget updated" namespace=preprod-d2rp podDisruptionBudget=rfs-redis-sentinel service=k8s.podDisruptionBudget src="poddisruptionb │
│ udget.go:85"                                                                                                                                                                               │
│ time="2022-11-23T16:30:10Z" level=info msg="deployment updated" deployment=rfs-redis-sentinel namespace=preprod-d2rp service=k8s.deployment src="deployment.go:109"                        │
│ time="2022-11-23T16:30:10Z" level=error msg="Get redis info failed, maybe this node is not ready, pod ip: 10.233.69.188" src="checker.go:119"                                              │
│ time="2022-11-23T16:30:30Z" level=error msg="Get redis info failed, maybe this node is not ready, pod ip: 10.233.68.251" src="checker.go:119"                                              │
│ time="2022-11-23T16:30:30Z" level=error msg="Get redis info failed, maybe this node is not ready, pod ip: 10.233.67.94" src="checker.go:119"                                               │
│ time="2022-11-23T16:30:31Z" level=error msg="Make new master failed, master ip: 10.233.68.38, error: dial tcp 10.233.68.38:6379: i/o timeout" src="checker.go:143"                         │
│ time="2022-11-23T16:30:50Z" level=error msg="Get redis info failed, maybe this node is not ready, pod ip: 10.233.69.128" src="checker.go:119"                                              │
│ time="2022-11-23T16:30:50Z" level=error msg="Get redis info failed, maybe this node is not ready, pod ip: 10.233.68.135" src="checker.go:119"                                              │
│ time="2022-11-23T16:30:51Z" level=error msg="Make slave failed, slave pod ip: 10.233.69.6, master ip: 10.233.68.38, error: dial tcp 10.233.69.6:6379: i/o timeout" src="checker.go:143"    │
│ time="2022-11-23T16:31:10Z" level=error msg="Get redis info failed, maybe this node is not ready, pod ip: 10.233.69.253" src="checker.go:119"                                              │
│ time="2022-11-23T16:31:10Z" level=error msg="Get redis info failed, maybe this node is not ready, pod ip: 10.233.69.98" src="checker.go:119" 
time="2022-11-23T16:31:11Z" level=error msg="Make slave failed, slave pod ip: 10.233.69.188, master ip: 10.233.68.38, error: dial tcp 10.233.69.188:6379: i/o timeout" src="checker.go:143 │
│ "                                                                                                                                                                                          │
│ time="2022-11-23T16:31:30Z" level=error msg="Make new master failed, master ip: 10.233.68.251, error: dial tcp 10.233.68.251:6379: i/o timeout" src="checker.go:143"                       │
│ time="2022-11-23T16:31:31Z" level=error msg="Make new master failed, master ip: 10.233.67.94, error: dial tcp 10.233.67.94:6379: i/o timeout" src="checker.go:143"                         │
│ time="2022-11-23T16:31:31Z" level=error msg="Get redis info failed, maybe this node is not ready, pod ip: 10.233.68.38" src="checker.go:158"                                               │
│ time="2022-11-23T16:31:50Z" level=error msg="Make slave failed, slave pod ip: 10.233.69.253, master ip: 10.233.68.251, error: dial tcp 10.233.69.253:6379: i/o timeout" src="checker.go:14 │
│ 3"                                                                                                                                                                                         │
│ time="2022-11-23T16:31:51Z" level=error msg="Make slave failed, slave pod ip: 10.233.68.135, master ip: 10.233.67.94, error: dial tcp 10.233.68.135:6379: i/o timeout" src="checker.go:143 │
│ "                                                                                                                                                                                          │
│ time="2022-11-23T16:31:51Z" level=error msg="Get redis info failed, maybe this node is not ready, pod ip: 10.233.69.6" src="checker.go:158"
vbode commented 1 year ago

This issue does not occur in kubernetes 1.21.9.

jackskj commented 1 year ago

seeing similar issue on 1.23.5, however, this issue happens when password auth is used

jackskj commented 1 year ago

noticed that readines probe fails at https://github.com/spotahome/redis-operator/blob/716d0d60dd561ea1041e5ab9a71cf8e0b8952159/operator/redisfailover/service/generator.go#L242

with AUTH failed: WRONGPASS invalid username-password pair or user is disabled. this is probably because kubelet is trying to login as non default user. running redis cli with the following format fixed readiness issue redis-cli -u redis://default:password@localhost:port info replication

jackskj commented 1 year ago

i have noticed that this issue was caused by accidentally having a space at the end of password in kubernetes secret redis automatically removed the space, however readiness script did not

zgfh commented 1 year ago

as operator log: time="2022-11-23T16:31:51Z" level=error msg="Make slave failed, slave pod ip: 10.233.68.135, master ip: 10.233.67.94, error: dial tcp 10.233.68.135:6379: i/o timeout" src="checker.go:143

this mean operator want to fix redis cluster,but can not access with :6379

in my case: I find can not access pod due to CNI network broken when I fix cni network,redis recover

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 45 days with no activity.

Wouter0100 commented 1 year ago

I have the same issue. We deploy 3 separate redis failovers, all with the issue. Interestingly enough, it's always the second one.

rfr-engine-be-0                                 2/2     Running     0                42d
rfr-engine-be-1                                 1/2     Running     0                5m3s
rfr-engine-be-2                                 2/2     Running     0                42d
rfr-engine-gb-0                                 2/2     Running     0                42d
rfr-engine-gb-1                                 1/2     Running     0                42d
rfr-engine-gb-2                                 2/2     Running     0                42d
rfr-engine-nl-0                                 2/2     Running     0                42d
rfr-engine-nl-1                                 1/2     Running     0                18d
rfr-engine-nl-2                                 2/2     Running     0                42d

Logs are all identical to:

1:S 01 Mar 2023 16:37:34.074 * Connecting to MASTER 127.0.0.1:6379
1:S 01 Mar 2023 16:37:34.074 * MASTER <-> REPLICA sync started
1:S 01 Mar 2023 16:37:34.074 * Non blocking connect for SYNC fired the event.
1:S 01 Mar 2023 16:37:34.075 * Master replied to PING, replication can continue...
1:S 01 Mar 2023 16:37:34.075 * Partial resynchronization not possible (no cached master)
1:S 01 Mar 2023 16:37:34.075 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master

Operator logs:

time="2023-03-01T16:39:46Z" level=error msg="error on object processing: more than one master, fix manually" controller-id=redisfailover object-key=production/engine-be operator=redisfailover service=kooper.controller src="controller.go:279"
time="2023-03-01T16:39:46Z" level=error msg="error on object processing: more than one master, fix manually" controller-id=redisfailover object-key=production/engine-gb operator=redisfailover service=kooper.controller src="controller.go:279"
time="2023-03-01T16:39:47Z" level=error msg="error on object processing: more than one master, fix manually" controller-id=redisfailover object-key=production/engine-nl operator=redisfailover service=kooper.controller src="controller.go:279"
zekena2 commented 1 year ago

I have the same issue with k3d/k3s cluster v1.24.10 but oddly enough the operator shows no errors whatsoever.

jouve commented 1 year ago

I had the issue because of network policies, after allowing the operator to connect to the redis pods (rfr-*), the problem was fixed.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 45 days with no activity.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale.