check if redis instances is replicating from unknown master (an IP no…

raghu-nandan-bs commented 1 year ago

Fix redisfailover instances replicating unknown master(s)

Fixes #550

Changes proposed on the PR:

Add a new check to check and heal loop, see if any redisfailover instances is replicating from unknown IP.
- if yes, select a master, simply reset sentinel config

raghu-nandan-bs commented 1 year ago

Test cases, mocks needs updating, will do it once we confirm that this change fixes the issue.

@zekena2 @tparsa can you please try building an image from this branch and test if it will fix the issue you are facing? I tried reproducing the issue and this change could fix the issue.

raghu-nandan-bs commented 1 year ago

steps to reproduce the issue

create an ephemeral cluster

kind create cluster --name 550

Install necessary components

Install CRD

kubectl create -f manifests/databases.spotahome.com_redisfailovers.yaml

Create redisfailovers

apiVersion: databases.spotahome.com/v1
kind: RedisFailover
metadata:
  name: redisfailover1
spec:
  sentinel:
    replicas: 3
    resources:
      requests:
        cpu: 100m
      limits:
        memory: 100Mi
  redis:
    replicas: 3
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
      limits:
        cpu: 400m
        memory: 500Mi
---
apiVersion: databases.spotahome.com/v1
kind: RedisFailover
metadata:
  name: redisfailover2
spec:
  sentinel:
    replicas: 3
    resources:
      requests:
        cpu: 100m
      limits:
        memory: 100Mi
  redis:
    replicas: 3
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
      limits:
        cpu: 400m
        memory: 500Mi

Build and load operator image

# from redis-operator repo's folder
make image
kind load docker-image  redis-operator --name 550

install redis operator

apply the following config:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: redisoperator
  name: redisoperator
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redisoperator
  strategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: redisoperator
    spec:
      serviceAccountName: redisoperator
      containers:
        - image: redis-operator:latest
          args:
          - --log-level=debug
          imagePullPolicy: Never
          name: app
          securityContext:
            readOnlyRootFilesystem: true
            runAsNonRoot: true
            runAsUser: 1000
          resources:
            limits:
              cpu: 100m
              memory: 50Mi
            requests:
              cpu: 10m
              memory: 50Mi
      restartPolicy: Always
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: redisoperator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: redisoperator
subjects:
  - kind: ServiceAccount
    name: redisoperator
    namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: redisoperator
rules:
  - apiGroups:
      - databases.spotahome.com
    resources:
      - redisfailovers
      - redisfailovers/finalizers
    verbs:
      - "*"
  - apiGroups:
      - apiextensions.k8s.io
    resources:
      - customresourcedefinitions
    verbs:
      - "*"
  - apiGroups:
      - ""
    resources:
      - pods
      - services
      - endpoints
      - events
      - configmaps
      - persistentvolumeclaims
      - persistentvolumeclaims/finalizers
    verbs:
      - "*"
  - apiGroups:
      - ""
    resources:
      - secrets
    verbs:
      - "get"
  - apiGroups:
      - apps
    resources:
      - deployments
      - statefulsets
    verbs:
      - "*"
  - apiGroups:
      - policy
    resources:
      - poddisruptionbudgets
    verbs:
      - "*"
  - apiGroups:
      - coordination.k8s.io
    resources:
      - leases
    verbs:
      - "*"

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: redisoperator
---

apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/port: http
    prometheus.io/scrape: "true"
  name: redisoperator
  labels:
    app: redisoperator
spec:
  type: ClusterIP
  ports:
  - name: metrics
    port: 9710
    protocol: TCP
    targetPort: metrics
  selector:
    app: redisoperator

Make all redis instances follow single master

# select a master 
MASTER=$(kubectl get pods -l redisfailovers-role=master -ojson | jq ' .items[1] | .status.podIP ' | tr -d '"')

# make all redis pods slave of this master
kubectl get pods -l app.kubernetes.io/component=redis  | awk '{print $1}' | grep -v NAME | xargs -I {} -n 1 sh -c "echo \"{}\" && kubectl exec -it  pod/{} -c redis -- redis-cli -p 6379 slaveof ${MASTER} 6379"

# following this, a new failover will be triggered by the sentinel superhive and new master will be chose for all instances both redis failovers
# repeat above command if the issue isnt reproduced

tparsa commented 1 year ago

I don't think this is gonna work. We were using version 1.2.2, which could set the master manually in this line and restoreSentinels in this line since the number of slaves is not expected and at least one more than the expected value.

raghu-nandan-bs commented 1 year ago

I don't think this is gonna work. We were using version 1.2.2, which could set the master manually in this line and restoreSentinels in this line since the number of slaves is not expected and at least one more than the expected value.

The piece of code where RestoreSentinel is called was not being reached in the flow I think... The checkandheal loop would simply exit after setting oldest as master. I think what follows is - sentinels would promptly make these redis instances follow the same old master again. and it forms a cycle. resetting sentinels simultaneously will hopefully break this cycle.

tparsa commented 1 year ago

Why would it exit? It only returns when there were an error while setting the oldest as master.

raghu-nandan-bs commented 1 year ago

Why would it exit? It only returns when there were an error while setting the oldest as master.

sorry, you're right.

tparsa commented 1 year ago

Maybe we should change the sentinel's config during SetOldestAsMaster.

raghu-nandan-bs commented 1 year ago

@tparsa I tried with 1.2.2 and it indeed fixed the issue, reproduced using following steps:

# select a master 
MASTER=$(k get pods -l redisfailovers-role=master -ojson | jq ' .items[1] | .status.podIP ' | tr -d '"')
MASTER_NAME=$(k get pods -l redisfailovers-role=master -ojson | jq ' .items[1] | .metadata.name ' | tr -d '"')

# make all redis pods slave of / monitor this master
k get pods -l app.kubernetes.io/component=sentinel | awk '{print $1}' | grep -v NAME | xargs -I {} -n 1 sh -c "echo \"{}\" && kubectl exec -it  pod/{} -c sentinel -- redis-cli -p 26379 sentinel remove mymaster" 
k get pods -l app.kubernetes.io/component=sentinel | awk '{print $1}' | grep -v NAME | xargs -I {} -n 1 sh -c "echo \"{}\" && kubectl exec -it  pod/{} -c sentinel -- redis-cli -p 26379 sentinel monitor mymaster ${MASTER} 6379 2"
kubectl exec -it  pod/${MASTER_NAME} -c redis -- redis-cli -p 6379 slaveof no one
k get pods -l app.kubernetes.io/component=redis  | awk '{print $1}' | grep -v NAME | grep -v ${MASTER_NAME} | xargs -I {} -n 1 sh -c "echo \"{}\" && kubectl exec -it  pod/{} -c redis -- redis-cli -p 6379 slaveof ${MASTER} 6379"

May be the problem is I'm not able to reproduce the issue in #550 (not the same as what I did), and therefore not able to get the rootcause.

could you please help with steps to reproduce the issue?

tparsa commented 1 year ago

We couldn't reproduce the issue manually, and we didn't have the time either. Maybe the problem arises when we have a lot of RedisFailovers and processing them could take more time than the requested CPU for operator. I'm not sure but I guess we had more occurrences with Redises that had many connections.

raghu-nandan-bs commented 1 year ago

We couldn't reproduce the issue manually, and we didn't have the time either. Maybe the problem arises when we have a lot of RedisFailovers and processing them could take more time than the requested CPU for operator. I'm not sure but I guess we had more occurrences with Redises that had many connections.

Okay, I think we should look for more novel solutions, @samof76 has some ideas. Closing this PR for now.

zekena2 commented 1 year ago

The most consistent way for reproducing it that worked for me is to make the operator causes it himself by restarting everything (for example due to a change to the default configuration). It was happening everytime when we were upgrading from v1.0 to a more recent version.

spotahome / redis-operator