rabbitmq / cluster-operator

RabbitMQ Cluster Kubernetes Operator
https://www.rabbitmq.com/kubernetes/operator/operator-overview.html
Mozilla Public License 2.0
853 stars 267 forks source link

Problem with overriding statefulset readiness probe #1698

Open bkelava opened 1 month ago

bkelava commented 1 month ago

Describe the bug

Overriding stateful set readiness probe from tcpSocket to exec keeps tcpSocket in its config.

To Reproduce

kubectl apply -f cluster-test.yml

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: rabbitmq-test
spec:
  replicas: 5
  image: 172.17.12.132:9110/rabbitmq/rabbitmq:3.13.4-management
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 1
      memory: 2Gi
  persistence:
    storageClassName: nfs-rabbitmq-test-storage
    storage: "10Gi"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - servisi-0023
            - servisi-0024
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers:
              - name: rabbitmq
                livenessProbe:
                  exec:
                    command:
                      - rabbitmq-diagnostics
                      - status
                  initialDelaySeconds: 60
                  periodSeconds: 60
                  timeoutSeconds: 15
                readinessProbe:
                  exec:
                    command:
                    - rabbitmq-diagnostics
                    - ping
                  initialDelaySeconds: 20
                  periodSeconds: 60
                  timeoutSeconds: 10
                securityContext:
                  allowPrivilegeEscalation: false
                  capabilities:
                    add:
                      - CHOWN
                  privileged: false
                  procMount: Default
                  readOnlyRootFilesystem: false
                  runAsNonRoot: false
                  runAsUser: 999
                  runAsGroup: 100
                volumeMounts:
                  - name: definitions-json
                    mountPath: /etc/rabbitmq/definitions.json
                    subPath: definitions.json
                  - name: rabbitmq-conf
                    mountPath: /etc/rabbitmq/rabbitmq.conf
                    subPath: rabbitmq.conf
                      #- name: rabbitmq-data
                      #mountPath: /var/lib/rabbitmq
            securityContext:
              fsGroup: 100
              runAsNonRoot: true
              runAsUser: 999
              runAsGroup: 100
            volumes:
              - name: definitions-json
                configMap:
                  name: rabbitmq-configmap
                  items:
                    - key: definitions.json
                      path: definitions.json
              - name: rabbitmq-conf
                configMap:
                  name: rabbitmq-configmap
                  items:
                    - key: rabbitmq.conf
                      path: rabbitmq.conf
                        #volumeClaimTemplates:
                        #- metadata:
                        #name: rabbitmq-data
                        #  annotations:
                        # volume.alpha.kubernetes.io/storage-class: nfs-rabbitmq-test-storage
                        #      spec:
                        #     accessModes:
                        #  - ReadWriteOnce
                        #  storageClassName: nfs-rabbitmq-test-storage
                        #  resources:
                        #       requests:
                        #        storage: 10Gi

kubectl get statefulset rabbitmq-test-server -o yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    rabbitmq.com/createdAt: "2024-08-13T09:42:31Z"
  creationTimestamp: "2024-08-13T09:42:31Z"
  generation: 1
  labels:
    app.kubernetes.io/component: rabbitmq
    app.kubernetes.io/name: rabbitmq-test
    app.kubernetes.io/part-of: rabbitmq
  name: rabbitmq-test-server
  namespace: rabbitmq-test
  ownerReferences:
  - apiVersion: rabbitmq.com/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: RabbitmqCluster
    name: rabbitmq-test
    uid: 073ca32b-3fb0-4c92-a0b5-b840c679e36a
  resourceVersion: "23728935"
  uid: 704acd08-39cd-4507-b731-9d4f66c1813c
spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain
    whenScaled: Retain
  podManagementPolicy: Parallel
  replicas: 5
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/name: rabbitmq-test
  serviceName: rabbitmq-test-nodes
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: rabbitmq
        app.kubernetes.io/name: rabbitmq-test
        app.kubernetes.io/part-of: rabbitmq
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - servisi-0023
                - servisi-0024
      automountServiceAccountToken: true
      containers:
      - env:
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: K8S_SERVICE_NAME
          value: rabbitmq-test-nodes
        - name: RABBITMQ_ENABLED_PLUGINS_FILE
          value: /operator/enabled_plugins
        - name: RABBITMQ_USE_LONGNAME
          value: "true"
        - name: RABBITMQ_NODENAME
          value: rabbit@$(MY_POD_NAME).$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE)
        - name: K8S_HOSTNAME_SUFFIX
          value: .$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE)
        image: 172.17.12.132:9110/rabbitmq/rabbitmq:3.13.4-management
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/bash
              - -c
              - if [ ! -z "$(cat /etc/pod-info/skipPreStopChecks)" ]; then exit 0;
                fi; rabbitmq-upgrade await_online_quorum_plus_one -t 604800 && rabbitmq-upgrade
                await_online_synchronized_mirror -t 604800 && rabbitmq-upgrade drain
                -t 604800
        livenessProbe:
          exec:
            command:
            - rabbitmq-diagnostics
            - status
          failureThreshold: 3
          initialDelaySeconds: 60
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 15
        name: rabbitmq
        ports:
        - containerPort: 4369
          name: epmd
          protocol: TCP
        - containerPort: 5672
          name: amqp
          protocol: TCP
        - containerPort: 15672
          name: management
          protocol: TCP
        - containerPort: 15692
          name: prometheus
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - rabbitmq-diagnostics
            - ping
          failureThreshold: 3
          initialDelaySeconds: 20
          periodSeconds: 60
          successThreshold: 1
          tcpSocket:
            port: amqp
          timeoutSeconds: 10
        resources:
          limits:
            cpu: "1"
            memory: 2Gi
          requests:
            cpu: 500m
            memory: 1Gi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            add:
            - CHOWN
          privileged: false
          procMount: Default
          readOnlyRootFilesystem: false
          runAsGroup: 100
          runAsNonRoot: false
          runAsUser: 999
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/rabbitmq/
          name: rabbitmq-erlang-cookie
        - mountPath: /var/lib/rabbitmq/mnesia/
          name: persistence
        - mountPath: /etc/rabbitmq/definitions.json
          name: definitions-json
          subPath: definitions.json
        - mountPath: /etc/rabbitmq/rabbitmq.conf
          name: rabbitmq-conf
          subPath: rabbitmq.conf
        - mountPath: /operator
          name: rabbitmq-plugins
        - mountPath: /etc/rabbitmq/conf.d/10-operatorDefaults.conf
          name: rabbitmq-confd
          subPath: operatorDefaults.conf
        - mountPath: /etc/rabbitmq/conf.d/90-userDefinedConfiguration.conf
          name: rabbitmq-confd
          subPath: userDefinedConfiguration.conf
        - mountPath: /etc/pod-info/
          name: pod-info
        - mountPath: /etc/rabbitmq/conf.d/11-default_user.conf
          name: rabbitmq-confd
          subPath: default_user.conf
      dnsPolicy: ClusterFirst
      initContainers:
      - command:
        - sh
        - -c
        - cp /tmp/erlang-cookie-secret/.erlang.cookie /var/lib/rabbitmq/.erlang.cookie
          && chmod 600 /var/lib/rabbitmq/.erlang.cookie ; cp /tmp/rabbitmq-plugins/enabled_plugins
          /operator/enabled_plugins ; echo '[default]' > /var/lib/rabbitmq/.rabbitmqadmin.conf
          && sed -e 's/default_user/username/' -e 's/default_pass/password/' /tmp/default_user.conf
          >> /var/lib/rabbitmq/.rabbitmqadmin.conf && chmod 600 /var/lib/rabbitmq/.rabbitmqadmin.conf
          ; sleep 30
        image: 172.17.12.132:9110/rabbitmq/rabbitmq:3.13.4-management
        imagePullPolicy: IfNotPresent
        name: setup-container
        resources:
          limits:
            cpu: 100m
            memory: 500Mi
          requests:
            cpu: 100m
            memory: 500Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /tmp/rabbitmq-plugins/
          name: plugins-conf
        - mountPath: /var/lib/rabbitmq/
          name: rabbitmq-erlang-cookie
        - mountPath: /tmp/erlang-cookie-secret/
          name: erlang-cookie-secret
        - mountPath: /operator
          name: rabbitmq-plugins
        - mountPath: /var/lib/rabbitmq/mnesia/
          name: persistence
        - mountPath: /tmp/default_user.conf
          name: rabbitmq-confd
          subPath: default_user.conf
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 100
        runAsGroup: 100
        runAsNonRoot: true
        runAsUser: 999
      serviceAccount: rabbitmq-test-server
      serviceAccountName: rabbitmq-test-server
      terminationGracePeriodSeconds: 604800
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/name: rabbitmq-test
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
      volumes:
      - configMap:
          defaultMode: 420
          items:
          - key: definitions.json
            path: definitions.json
          name: rabbitmq-configmap
        name: definitions-json
      - configMap:
          defaultMode: 420
          items:
          - key: rabbitmq.conf
            path: rabbitmq.conf
          name: rabbitmq-configmap
        name: rabbitmq-conf
      - configMap:
          defaultMode: 420
          name: rabbitmq-test-plugins-conf
        name: plugins-conf
      - name: rabbitmq-confd
        projected:
          defaultMode: 420
          sources:
          - configMap:
              items:
              - key: operatorDefaults.conf
                path: operatorDefaults.conf
              - key: userDefinedConfiguration.conf
                path: userDefinedConfiguration.conf
              name: rabbitmq-test-server-conf
          - secret:
              items:
              - key: default_user.conf
                path: default_user.conf
              name: rabbitmq-test-default-user
      - emptyDir: {}
        name: rabbitmq-erlang-cookie
      - name: erlang-cookie-secret
        secret:
          defaultMode: 420
          secretName: rabbitmq-test-erlang-cookie
      - emptyDir: {}
        name: rabbitmq-plugins
      - downwardAPI:
          defaultMode: 420
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.labels['skipPreStopChecks']
            path: skipPreStopChecks
        name: pod-info
  updateStrategy:
    rollingUpdate:
      partition: 0
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: rabbitmq
        app.kubernetes.io/name: rabbitmq-test
        app.kubernetes.io/part-of: rabbitmq
      name: persistence
      namespace: rabbitmq-test
      ownerReferences:
      - apiVersion: rabbitmq.com/v1beta1
        blockOwnerDeletion: false
        controller: true
        kind: RabbitmqCluster
        name: rabbitmq-test
        uid: 073ca32b-3fb0-4c92-a0b5-b840c679e36a
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi
      storageClassName: nfs-rabbitmq-test-storage
      volumeMode: Filesystem
    status:
      phase: Pending
status:
  availableReplicas: 0
  collisionCount: 0
  currentRevision: rabbitmq-test-server-5b4fd5484d
  observedGeneration: 1
  replicas: 0
  updateRevision: rabbitmq-test-server-5b4fd5484d

statefulset did not override readiness probe but keeps both exec and tcpSocket configs as follows:

        readinessProbe:
          exec:
            command:
            - rabbitmq-diagnostics
            - ping
          failureThreshold: 3
          initialDelaySeconds: 20
          periodSeconds: 60
          successThreshold: 1
          tcpSocket:
            port: amqp
          timeoutSeconds: 10

which results in error

Events:
  Type     Reason            Age                    From                    Message
  ----     ------            ----                   ----                    -------
  Normal   SuccessfulCreate  9m31s                  statefulset-controller  create Claim persistence-rabbitmq-test-server-0 Pod rabbitmq-test-server-0 in StatefulSet rabbitmq-test-server success
  Warning  FailedCreate      4m4s (x17 over 9m31s)  statefulset-controller  create Pod rabbitmq-test-server-0 in StatefulSet rabbitmq-test-server failed error: Pod "rabbitmq-test-server-0" is invalid: spec.containers[0].readinessProbe.tcpSocket: Forbidden: may not specify more than 1 handler type

patching stateful set is an option to fix but it is not ideal!, please help.

mkuratczyk commented 1 month ago

While allowing the probe to be overriden is something we can consider, can you explain what you are trying to accomplish here? Why do you expect rabbitmq-diagnostics ping to be a better readiness probe? What are the situations where it would be better?

sudhirjena commented 4 days ago

@mkuratczyk, we are facing the same issue with overriding readinessProbe.initialDelaySeconds. We are deploying rabbitmq on EKS + Fargate cluster and the intrinsic scheduling takes about 100 seconds. With the default for readinessProbe.initialDelaySeconds as 10s, we face the error everytime the rabbitmq pod is scheduled:

Readiness probe failed: dial tcp 10.35.177.155:5672: connect: connection refused
bkelava commented 4 days ago

@sudhirjena

I've temporary fixed error by commenting readinessProbe as follows:

...
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers:
              - name: rabbitmq
                livenessProbe:
                  exec:
                    command:
                      - rabbitmq-diagnostics
                      - status
                  initialDelaySeconds: 60
                  periodSeconds: 60
                  timeoutSeconds: 15
                # readinessProbe:
                #   tcpSocket:
                #     port: 22
                #   # exec:
                #   #   command:
                #   #   - rabbitmq-diagnostics
                #   #   - ping
                #   initialDelaySeconds: 20
                #   periodSeconds: 60
                #   timeoutSeconds: 10
                securityContext:
                  allowPrivilegeEscalation: false
                  capabilities:
                    add:
                      - CHOWN
                  privileged: false
                  procMount: Default
                  readOnlyRootFilesystem: false
                  runAsNonRoot: false
                  runAsUser: 999
                  runAsGroup: 100
...

Cluster has started without errors

NAME                         READY   STATUS    RESTARTS   AGE     IP            NODE           NOMINATED NODE   READINESS GATES
pod/rabbitmq-test-server-0   1/1     Running   0          7d13h   10.33.128.2   servisi-0023   <none>           <none>
pod/rabbitmq-test-server-1   1/1     Running   0          7d13h   10.33.128.3   servisi-0023   <none>           <none>
pod/rabbitmq-test-server-2   1/1     Running   0          7d13h   10.35.128.3   servisi-0024   <none>           <none>
pod/rabbitmq-test-server-3   1/1     Running   0          7d13h   10.33.128.4   servisi-0023   <none>           <none>
pod/rabbitmq-test-server-4   1/1     Running   0          7d13h   10.35.128.2   servisi-0024   <none>           <none>

NAME                          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                        AGE   SELECTOR
service/rabbitmq-test         ClusterIP   10.245.98.250   <none>        5672/TCP,15672/TCP,15692/TCP   27d   app.kubernetes.io/name=rabbitmq-test
service/rabbitmq-test-nodes   ClusterIP   None            <none>        4369/TCP,25672/TCP             27d   app.kubernetes.io/name=rabbitmq-test

But as always, temporary solution might be a permanent one 🥇

mkuratczyk commented 4 days ago

We are not against the idea, so PRs welcome. This is an open source project, you don't have to wait for us to get around to implementing this.