Closed raghu-nandan-bs closed 1 year ago
Test cases, mocks needs updating, will do it once we confirm that this change fixes the issue.
@zekena2 @tparsa can you please try building an image from this branch and test if it will fix the issue you are facing? I tried reproducing the issue and this change could fix the issue.
kind create cluster --name 550
kubectl create -f manifests/databases.spotahome.com_redisfailovers.yaml
apiVersion: databases.spotahome.com/v1
kind: RedisFailover
metadata:
name: redisfailover1
spec:
sentinel:
replicas: 3
resources:
requests:
cpu: 100m
limits:
memory: 100Mi
redis:
replicas: 3
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 400m
memory: 500Mi
---
apiVersion: databases.spotahome.com/v1
kind: RedisFailover
metadata:
name: redisfailover2
spec:
sentinel:
replicas: 3
resources:
requests:
cpu: 100m
limits:
memory: 100Mi
redis:
replicas: 3
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 400m
memory: 500Mi
# from redis-operator repo's folder
make image
kind load docker-image redis-operator --name 550
apply the following config:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: redisoperator
name: redisoperator
spec:
replicas: 1
selector:
matchLabels:
app: redisoperator
strategy:
type: RollingUpdate
template:
metadata:
labels:
app: redisoperator
spec:
serviceAccountName: redisoperator
containers:
- image: redis-operator:latest
args:
- --log-level=debug
imagePullPolicy: Never
name: app
securityContext:
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
resources:
limits:
cpu: 100m
memory: 50Mi
requests:
cpu: 10m
memory: 50Mi
restartPolicy: Always
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: redisoperator
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: redisoperator
subjects:
- kind: ServiceAccount
name: redisoperator
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: redisoperator
rules:
- apiGroups:
- databases.spotahome.com
resources:
- redisfailovers
- redisfailovers/finalizers
verbs:
- "*"
- apiGroups:
- apiextensions.k8s.io
resources:
- customresourcedefinitions
verbs:
- "*"
- apiGroups:
- ""
resources:
- pods
- services
- endpoints
- events
- configmaps
- persistentvolumeclaims
- persistentvolumeclaims/finalizers
verbs:
- "*"
- apiGroups:
- ""
resources:
- secrets
verbs:
- "get"
- apiGroups:
- apps
resources:
- deployments
- statefulsets
verbs:
- "*"
- apiGroups:
- policy
resources:
- poddisruptionbudgets
verbs:
- "*"
- apiGroups:
- coordination.k8s.io
resources:
- leases
verbs:
- "*"
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: redisoperator
---
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: http
prometheus.io/scrape: "true"
name: redisoperator
labels:
app: redisoperator
spec:
type: ClusterIP
ports:
- name: metrics
port: 9710
protocol: TCP
targetPort: metrics
selector:
app: redisoperator
# select a master
MASTER=$(kubectl get pods -l redisfailovers-role=master -ojson | jq ' .items[1] | .status.podIP ' | tr -d '"')
# make all redis pods slave of this master
kubectl get pods -l app.kubernetes.io/component=redis | awk '{print $1}' | grep -v NAME | xargs -I {} -n 1 sh -c "echo \"{}\" && kubectl exec -it pod/{} -c redis -- redis-cli -p 6379 slaveof ${MASTER} 6379"
# following this, a new failover will be triggered by the sentinel superhive and new master will be chose for all instances both redis failovers
# repeat above command if the issue isnt reproduced
I don't think this is gonna work. We were using version 1.2.2, which could set the master manually in this line and restoreSentinels in this line since the number of slaves is not expected and at least one more than the expected value.
The piece of code where RestoreSentinel is called was not being reached in the flow I think... The checkandheal loop would simply exit after setting oldest as master. I think what follows is - sentinels would promptly make these redis instances follow the same old master again. and it forms a cycle. resetting sentinels simultaneously will hopefully break this cycle.
Why would it exit? It only returns when there were an error while setting the oldest as master.
Why would it exit? It only returns when there were an error while setting the oldest as master.
sorry, you're right.
Maybe we should change the sentinel's config during SetOldestAsMaster.
@tparsa I tried with 1.2.2 and it indeed fixed the issue, reproduced using following steps:
# select a master
MASTER=$(k get pods -l redisfailovers-role=master -ojson | jq ' .items[1] | .status.podIP ' | tr -d '"')
MASTER_NAME=$(k get pods -l redisfailovers-role=master -ojson | jq ' .items[1] | .metadata.name ' | tr -d '"')
# make all redis pods slave of / monitor this master
k get pods -l app.kubernetes.io/component=sentinel | awk '{print $1}' | grep -v NAME | xargs -I {} -n 1 sh -c "echo \"{}\" && kubectl exec -it pod/{} -c sentinel -- redis-cli -p 26379 sentinel remove mymaster"
k get pods -l app.kubernetes.io/component=sentinel | awk '{print $1}' | grep -v NAME | xargs -I {} -n 1 sh -c "echo \"{}\" && kubectl exec -it pod/{} -c sentinel -- redis-cli -p 26379 sentinel monitor mymaster ${MASTER} 6379 2"
kubectl exec -it pod/${MASTER_NAME} -c redis -- redis-cli -p 6379 slaveof no one
k get pods -l app.kubernetes.io/component=redis | awk '{print $1}' | grep -v NAME | grep -v ${MASTER_NAME} | xargs -I {} -n 1 sh -c "echo \"{}\" && kubectl exec -it pod/{} -c redis -- redis-cli -p 6379 slaveof ${MASTER} 6379"
May be the problem is I'm not able to reproduce the issue in #550 (not the same as what I did), and therefore not able to get the rootcause.
could you please help with steps to reproduce the issue?
We couldn't reproduce the issue manually, and we didn't have the time either. Maybe the problem arises when we have a lot of RedisFailovers and processing them could take more time than the requested CPU for operator. I'm not sure but I guess we had more occurrences with Redises that had many connections.
We couldn't reproduce the issue manually, and we didn't have the time either. Maybe the problem arises when we have a lot of RedisFailovers and processing them could take more time than the requested CPU for operator. I'm not sure but I guess we had more occurrences with Redises that had many connections.
Okay, I think we should look for more novel solutions, @samof76 has some ideas. Closing this PR for now.
The most consistent way for reproducing it that worked for me is to make the operator causes it himself by restarting everything (for example due to a change to the default configuration). It was happening everytime when we were upgrading from v1.0 to a more recent version.
Fix redisfailover instances replicating unknown master(s)
Fixes #550
Changes proposed on the PR: