mongodb / mongodb-kubernetes-operator

MongoDB Community Kubernetes Operator
Other
1.23k stars 501 forks source link

Readiness probe failing continuously with operator 0.7.2 #912

Closed LoonyRules closed 1 year ago

LoonyRules commented 2 years ago

What did you do to encounter the bug? Steps to reproduce the behavior:

  1. Install a 2 member replicaset using the operator's MongoDBCommunity CRD.
  2. Wait for all pods to boot and for the replicaset to become health and accessible.

What did you expect? ReplicaSet to boot successfully and for readiness and liveness probes to complete successfully.

What happened instead? Readiness probe warning comes up then disappears. A minute later after both pods have booted and the service is accessible via MongoDB Compass, a kubernetes event is triggered claiming the readiness probe timed out. The pod does not restart, the replicaset is still healthy but this made kubernetes unhappy. Longhorn then throws a fit due to the error and puts the logs-volume in a readonly state.

Screenshots If applicable, add screenshots to help explain your problem.

Operator Information

Kubernetes Cluster Information

Additional context I found an issue that is very similar to what I am coming across (https://github.com/mongodb/mongodb-kubernetes-operator/issues/668) but with the exception of increasing the readiness probe variables does nothing but just make the issue happen further in the future. At some point, the readiness probe will run and will fail.

When accessing the pod shell of mongodb-0 or mongodb-1 (as they both fail), running /opt/scripts/readinessprobe returns a "no such file or directory exists" error so I'm assuming the underlying issue could be that the readinessprobe script it's trying to run doesn't actually exist?

cat /var/log/mongodb-mms-automation/readiness.log returns a lot of:

{"level":"info","ts":1646397066.4455326,"msg":"Agent has reached goal state"}

of which backs up my point that the replica is happy when this event occurs.

cat /healthstatus/agent-health-status.json results:

I have no name!@mongodb-1:/$ cat /healthstatus/agent-health-status.json
{"statuses":{"mongodb-1":{"IsInGoalState":true,"LastMongoUpTime":1646397416,"ExpectedToBeUp":true,"ReplicationStatus":1}},"mmsStatus":{"mongodb-1":{"name":"mongodb-1","lastGoalVersionAchieved":23,"plans":[{"started":"2022-03-04T12:03:45.602518511Z","completed":"2022-03-04T12:03:58.211444431Z","moves":[{"move":"
Start","moveDoc":"Start the process","steps":[{"step":"StartFresh","stepDoc":"Start a mongo instance  (start fresh)","isWaitStep":false,"started":"2022-03-04T12:03:45.602557342Z","completed":"2022-03-04T12:03:56.873578602Z","result":"success"}]},{"move":"WaitRsInit","moveDoc":"Wait for the replica set to be ini
tialized by another member","steps":[{"step":"WaitRsInit","stepDoc":"Wait for the replica set to be initialized by another member","isWaitStep":true,"started":"2022-03-04T12:03:56.873814106Z","completed":null,"result":"error"}]},{"move":"WaitFeatureCompatibilityVersionCorrect","moveDoc":"Wait for featureCompati
bilityVersion to be right","steps":[{"step":"WaitFeatureCompatibilityVersionCorrect","stepDoc":"Wait for featureCompatibilityVersion to be right","isWaitStep":true,"started":null,"completed":null,"result":""}]}]},{"started":"2022-03-04T12:04:28.797596949Z","completed":"2022-03-04T12:04:29.988272348Z","moves":[{
"move":"WaitHasCorrectAutomationCredentials","moveDoc":"Wait for the automation user to be added (if needed)","steps":[{"step":"WaitHasCorrectAutomationCredentials","stepDoc":"Wait for the automation user to be added (if needed)","isWaitStep":true,"started":"2022-03-04T12:04:28.797723502Z","completed":null,"res
ult":"wait"}]}]}],"errorCode":0,"errorString":""}}}
I have no name!@mongodb-1:/$

If possible, please include:

readiness probe time out event: note: this happens to both mongodb-0 and mongodb-1 pods at almost the same time because that's when they booted.

apiVersion: v1
kind: Event
metadata:
  name: mongodb-0.16d92c00fe7c8b22
  namespace: mongodb
  uid: 2e1a6823-b714-431b-aa60-922ea36f23c8
  resourceVersion: '12756491950'
  creationTimestamp: '2022-03-04T12:05:21Z'
  managedFields:
    - manager: kubelet
      operation: Update
      apiVersion: v1
      time: '2022-03-04T12:05:21Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:count: {}
        f:firstTimestamp: {}
        f:involvedObject: {}
        f:lastTimestamp: {}
        f:message: {}
        f:reason: {}
        f:source:
          f:component: {}
          f:host: {}
        f:type: {}
  selfLink: /api/v1/namespaces/mongodb/events/mongodb-0.16d92c00fe7c8b22
involvedObject:
  kind: Pod
  namespace: mongodb
  name: mongodb-0
  uid: 4d9dec5c-b7af-4d3b-8941-ecad5af2aa71
  apiVersion: v1
  resourceVersion: '12756232422'
  fieldPath: spec.containers{mongodb-agent}
reason: Unhealthy
message: 'Readiness probe failed: command "/opt/scripts/readinessprobe" timed out'
source:
  component: kubelet
  host: scw-k8s-elated-menin-pool-mystifying-sh-b86a51
firstTimestamp: '2022-03-04T12:05:21Z'
lastTimestamp: '2022-03-04T12:17:00Z'
count: 6
type: Warning
eventTime: null
reportingComponent: ''
reportingInstance: ''

mongodb CRD

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mongodb
  namespace: mongodb
  uid: 9ed1f59d-e4b2-4136-8d24-7079d28cb0e2
  resourceVersion: '12756278468'
  generation: 32
  creationTimestamp: '2022-02-07T17:40:58Z'
  labels:
    app: mongodb
  ownerReferences:
    - apiVersion: mongodbcommunity.mongodb.com/v1
      kind: MongoDBCommunity
      name: mongodb
      uid: 005ed612-803f-450d-aa73-8ab9a50e1e97
      controller: true
      blockOwnerDeletion: true
  managedFields:
    - manager: node-fetch
      operation: Update
      apiVersion: apps/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          f:replicas: {}
      subresource: scale
    - manager: node-fetch
      operation: Update
      apiVersion: apps/v1
      time: '2022-03-04T11:41:12Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          f:template:
            f:metadata:
              f:annotations:
                .: {}
                f:prometheus.io/path: {}
                f:prometheus.io/port: {}
                f:prometheus.io/scrape: {}
            f:spec:
              f:containers:
                k:{"name":"mongodb-agent"}:
                  f:readinessProbe:
                    f:timeoutSeconds: {}
    - manager: manager
      operation: Update
      apiVersion: apps/v1
      time: '2022-03-04T11:41:13Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:labels:
            .: {}
            f:app: {}
          f:ownerReferences:
            .: {}
            k:{"uid":"005ed612-803f-450d-aa73-8ab9a50e1e97"}: {}
        f:spec:
          f:podManagementPolicy: {}
          f:revisionHistoryLimit: {}
          f:selector: {}
          f:serviceName: {}
          f:template:
            f:metadata:
              f:labels:
                .: {}
                f:app: {}
            f:spec:
              f:containers:
                k:{"name":"mongod"}:
                  .: {}
                  f:command: {}
                  f:env:
                    .: {}
                    k:{"name":"AGENT_STATUS_FILEPATH"}:
                      .: {}
                      f:name: {}
                      f:value: {}
                  f:image: {}
                  f:imagePullPolicy: {}
                  f:name: {}
                  f:resources:
                    .: {}
                    f:limits:
                      .: {}
                      f:cpu: {}
                      f:memory: {}
                    f:requests:
                      .: {}
                      f:cpu: {}
                      f:memory: {}
                  f:securityContext:
                    .: {}
                    f:runAsNonRoot: {}
                    f:runAsUser: {}
                  f:terminationMessagePath: {}
                  f:terminationMessagePolicy: {}
                  f:volumeMounts:
                    .: {}
                    k:{"mountPath":"/data"}:
                      .: {}
                      f:mountPath: {}
                      f:name: {}
                    k:{"mountPath":"/healthstatus"}:
                      .: {}
                      f:mountPath: {}
                      f:name: {}
                    k:{"mountPath":"/hooks"}:
                      .: {}
                      f:mountPath: {}
                      f:name: {}
                    k:{"mountPath":"/var/lib/mongodb-mms-automation/authentication"}:
                      .: {}
                      f:mountPath: {}
                      f:name: {}
                    k:{"mountPath":"/var/log/mongodb-mms-automation"}:
                      .: {}
                      f:mountPath: {}
                      f:name: {}
                k:{"name":"mongodb-agent"}:
                  .: {}
                  f:command: {}
                  f:env:
                    .: {}
                    k:{"name":"AGENT_STATUS_FILEPATH"}:
                      .: {}
                      f:name: {}
                      f:value: {}
                    k:{"name":"AUTOMATION_CONFIG_MAP"}:
                      .: {}
                      f:name: {}
                      f:value: {}
                    k:{"name":"HEADLESS_AGENT"}:
                      .: {}
                      f:name: {}
                      f:value: {}
                    k:{"name":"POD_NAMESPACE"}:
                      .: {}
                      f:name: {}
                      f:valueFrom:
                        .: {}
                        f:fieldRef: {}
                  f:image: {}
                  f:imagePullPolicy: {}
                  f:name: {}
                  f:readinessProbe:
                    .: {}
                    f:exec:
                      .: {}
                      f:command: {}
                    f:failureThreshold: {}
                    f:initialDelaySeconds: {}
                    f:periodSeconds: {}
                    f:successThreshold: {}
                  f:resources:
                    .: {}
                    f:limits:
                      .: {}
                      f:cpu: {}
                      f:memory: {}
                    f:requests:
                      .: {}
                      f:cpu: {}
                      f:memory: {}
                  f:securityContext:
                    .: {}
                    f:runAsNonRoot: {}
                    f:runAsUser: {}
                  f:terminationMessagePath: {}
                  f:terminationMessagePolicy: {}
                  f:volumeMounts:
                    .: {}
                    k:{"mountPath":"/data"}:
                      .: {}
                      f:mountPath: {}
                      f:name: {}
                    k:{"mountPath":"/opt/scripts"}:
                      .: {}
                      f:mountPath: {}
                      f:name: {}
                    k:{"mountPath":"/var/lib/automation/config"}:
                      .: {}
                      f:mountPath: {}
                      f:name: {}
                      f:readOnly: {}
                    k:{"mountPath":"/var/lib/mongodb-mms-automation/authentication"}:
                      .: {}
                      f:mountPath: {}
                      f:name: {}
                    k:{"mountPath":"/var/log/mongodb-mms-automation"}:
                      .: {}
                      f:mountPath: {}
                      f:name: {}
                    k:{"mountPath":"/var/log/mongodb-mms-automation/healthstatus"}:
                      .: {}
                      f:mountPath: {}
                      f:name: {}
                k:{"name":"mongodb-exporter"}:
                  .: {}
                  f:args: {}
                  f:command: {}
                  f:env:
                    .: {}
                    k:{"name":"DBPASS"}:
                      .: {}
                      f:name: {}
                      f:valueFrom:
                        .: {}
                        f:secretKeyRef: {}
                    k:{"name":"DBURL"}:
                      .: {}
                      f:name: {}
                      f:value: {}
                    k:{"name":"DBUSER"}:
                      .: {}
                      f:name: {}
                      f:value: {}
                  f:image: {}
                  f:imagePullPolicy: {}
                  f:name: {}
                  f:ports:
                    .: {}
                    k:{"containerPort":9216,"protocol":"TCP"}:
                      .: {}
                      f:containerPort: {}
                      f:name: {}
                      f:protocol: {}
                  f:resources:
                    .: {}
                    f:limits:
                      .: {}
                      f:cpu: {}
                      f:memory: {}
                    f:requests:
                      .: {}
                      f:cpu: {}
                      f:memory: {}
                  f:securityContext:
                    .: {}
                    f:allowPrivilegeEscalation: {}
                  f:terminationMessagePath: {}
                  f:terminationMessagePolicy: {}
              f:dnsPolicy: {}
              f:initContainers:
                .: {}
                k:{"name":"mongod-posthook"}:
                  .: {}
                  f:command: {}
                  f:image: {}
                  f:imagePullPolicy: {}
                  f:name: {}
                  f:resources: {}
                  f:terminationMessagePath: {}
                  f:terminationMessagePolicy: {}
                  f:volumeMounts:
                    .: {}
                    k:{"mountPath":"/hooks"}:
                      .: {}
                      f:mountPath: {}
                      f:name: {}
                k:{"name":"mongodb-agent-readinessprobe"}:
                  .: {}
                  f:command: {}
                  f:image: {}
                  f:imagePullPolicy: {}
                  f:name: {}
                  f:resources: {}
                  f:terminationMessagePath: {}
                  f:terminationMessagePolicy: {}
                  f:volumeMounts:
                    .: {}
                    k:{"mountPath":"/opt/scripts"}:
                      .: {}
                      f:mountPath: {}
                      f:name: {}
              f:restartPolicy: {}
              f:schedulerName: {}
              f:securityContext:
                .: {}
                f:fsGroup: {}
              f:serviceAccount: {}
              f:serviceAccountName: {}
              f:terminationGracePeriodSeconds: {}
              f:volumes:
                .: {}
                k:{"name":"agent-scripts"}:
                  .: {}
                  f:emptyDir: {}
                  f:name: {}
                k:{"name":"automation-config"}:
                  .: {}
                  f:name: {}
                  f:secret:
                    .: {}
                    f:defaultMode: {}
                    f:secretName: {}
                k:{"name":"healthstatus"}:
                  .: {}
                  f:emptyDir: {}
                  f:name: {}
                k:{"name":"hooks"}:
                  .: {}
                  f:emptyDir: {}
                  f:name: {}
                k:{"name":"mongodb-keyfile"}:
                  .: {}
                  f:emptyDir: {}
                  f:name: {}
          f:updateStrategy:
            f:type: {}
          f:volumeClaimTemplates: {}
    - manager: kubectl-rollout
      operation: Update
      apiVersion: apps/v1
      time: '2022-03-04T12:02:53Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          f:template:
            f:metadata:
              f:annotations:
                f:kubectl.kubernetes.io/restartedAt: {}
    - manager: kube-controller-manager
      operation: Update
      apiVersion: apps/v1
      time: '2022-03-04T12:06:48Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          f:availableReplicas: {}
          f:collisionCount: {}
          f:currentReplicas: {}
          f:currentRevision: {}
          f:observedGeneration: {}
          f:readyReplicas: {}
          f:replicas: {}
          f:updateRevision: {}
          f:updatedReplicas: {}
      subresource: status
  selfLink: /apis/apps/v1/namespaces/mongodb/statefulsets/mongodb
status:
  observedGeneration: 32
  replicas: 2
  readyReplicas: 2
  currentReplicas: 2
  updatedReplicas: 2
  currentRevision: mongodb-5ff65685fb
  updateRevision: mongodb-5ff65685fb
  collisionCount: 0
  availableReplicas: 2
spec:
  replicas: 2
  selector:
    matchLabels:
      app: mongodb
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: mongodb
      annotations:
        kubectl.kubernetes.io/restartedAt: '2022-03-04T12:02:53Z'
        prometheus.io/path: /metrics
        prometheus.io/port: '9216'
        prometheus.io/scrape: 'true'
    spec:
      volumes:
        - name: agent-scripts
          emptyDir: {}
        - name: automation-config
          secret:
            secretName: mongodb-config
            defaultMode: 416
        - name: healthstatus
          emptyDir: {}
        - name: hooks
          emptyDir: {}
        - name: mongodb-keyfile
          emptyDir: {}
      initContainers:
        - name: mongod-posthook
          image: >-
            quay.io/mongodb/mongodb-kubernetes-operator-version-upgrade-post-start-hook:1.0.3
          command:
            - cp
            - version-upgrade-hook
            - /hooks/version-upgrade
          resources: {}
          volumeMounts:
            - name: hooks
              mountPath: /hooks
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: Always
        - name: mongodb-agent-readinessprobe
          image: quay.io/mongodb/mongodb-kubernetes-readinessprobe:1.0.6
          command:
            - cp
            - /probes/readinessprobe
            - /opt/scripts/readinessprobe
          resources: {}
          volumeMounts:
            - name: agent-scripts
              mountPath: /opt/scripts
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: Always
      containers:
        - name: mongod
          image: docker.io/mongo:5.0.6
          command:
            - /bin/sh
            - '-c'
            - >+

              #run post-start hook to handle version changes

              /hooks/version-upgrade

              # wait for config and keyfile to be created by the agent
               while ! [ -f /data/automation-mongod.conf -a -f /var/lib/mongodb-mms-automation/authentication/keyfile ]; do sleep 3 ; done ; sleep 2 ;

              # with mongod configured to append logs, we need to provide them
              to stdout as

              # mongod does not write to stdout and a log file

              tail -F /var/log/mongodb-mms-automation/mongodb.log > /dev/stdout
              &

              # start mongod with this configuration

              exec mongod -f /data/automation-mongod.conf;

          env:
            - name: AGENT_STATUS_FILEPATH
              value: /healthstatus/agent-health-status.json
          resources:
            limits:
              cpu: '4'
              memory: 7Gi
            requests:
              cpu: '2'
              memory: 2560Mi
          volumeMounts:
            - name: data-volume
              mountPath: /data
            - name: healthstatus
              mountPath: /healthstatus
            - name: hooks
              mountPath: /hooks
            - name: logs-volume
              mountPath: /var/log/mongodb-mms-automation
            - name: mongodb-keyfile
              mountPath: /var/lib/mongodb-mms-automation/authentication
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            runAsUser: 2000
            runAsNonRoot: true
        - name: mongodb-agent
          image: quay.io/mongodb/mongodb-agent:11.0.5.6963-1
          command:
            - /bin/bash
            - '-c'
            - >-
              current_uid=$(id -u)

              AGENT_API_KEY="$(cat
              /mongodb-automation/agent-api-key/agentApiKey)"

              declare -r current_uid

              if ! grep -q "${current_uid}" /etc/passwd ; then

              sed -e "s/^mongodb:/builder:/" /etc/passwd > /tmp/passwd

              echo "mongodb:x:$(id -u):$(id -g):,,,:/:/bin/bash" >> /tmp/passwd

              export NSS_WRAPPER_PASSWD=/tmp/passwd

              export LD_PRELOAD=libnss_wrapper.so

              export NSS_WRAPPER_GROUP=/etc/group

              fi

              agent/mongodb-agent
              -cluster=/var/lib/automation/config/cluster-config.json
              -healthCheckFilePath=/var/log/mongodb-mms-automation/healthstatus/agent-health-status.json
              -serveStatusPort=5000 -skipMongoStart -noDaemonize
              -useLocalMongoDbTools
          env:
            - name: AGENT_STATUS_FILEPATH
              value: >-
                /var/log/mongodb-mms-automation/healthstatus/agent-health-status.json
            - name: AUTOMATION_CONFIG_MAP
              value: mongodb-config
            - name: HEADLESS_AGENT
              value: 'true'
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
          resources:
            limits:
              cpu: '1'
              memory: 500M
            requests:
              cpu: 500m
              memory: 400M
          volumeMounts:
            - name: agent-scripts
              mountPath: /opt/scripts
            - name: automation-config
              readOnly: true
              mountPath: /var/lib/automation/config
            - name: data-volume
              mountPath: /data
            - name: healthstatus
              mountPath: /var/log/mongodb-mms-automation/healthstatus
            - name: logs-volume
              mountPath: /var/log/mongodb-mms-automation
            - name: mongodb-keyfile
              mountPath: /var/lib/mongodb-mms-automation/authentication
          readinessProbe:
            exec:
              command:
                - /opt/scripts/readinessprobe
            initialDelaySeconds: 5
            timeoutSeconds: 3
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 40
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: Always
          securityContext:
            runAsUser: 2000
            runAsNonRoot: true
        - name: mongodb-exporter
          image: bitnami/mongodb-exporter:0.20.6
          command:
            - /bin/sh
          args:
            - '-c'
            - >-
              /opt/bitnami/mongodb-exporter/bin/mongodb_exporter
              --discovering-mode --mongodb.indexstats-colls=someDBs
              --mongodb.collstats-colls=someDBs --web.listen-address=":9216"
              --web.telemetry-path="/metrics" --mongodb.direct-connect=false
              --mongodb.uri="mongodb://${DBUSER}:${DBPASS}@${DBURL}/admin?ssl=false"
          ports:
            - name: metrics
              containerPort: 9216
              protocol: TCP
          env:
            - name: DBPASS
              valueFrom:
                secretKeyRef:
                  name: mongodb-user-password
                  key: password
            - name: DBURL
              value: localhost:27017
            - name: DBUSER
              value: mongodb
          resources:
            limits:
              cpu: 500m
              memory: 800M
            requests:
              cpu: 500m
              memory: 500M
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: Always
          securityContext:
            allowPrivilegeEscalation: false
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      serviceAccountName: mongodb-database
      serviceAccount: mongodb-database
      securityContext:
        fsGroup: 2000
      schedulerName: default-scheduler
  volumeClaimTemplates:
    - kind: PersistentVolumeClaim
      apiVersion: v1
      metadata:
        name: data-volume
        creationTimestamp: null
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 50Gi
        storageClassName: longhorn
        volumeMode: Filesystem
      status:
        phase: Pending
    - kind: PersistentVolumeClaim
      apiVersion: v1
      metadata:
        name: logs-volume
        creationTimestamp: null
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 2G
        volumeMode: Filesystem
      status:
        phase: Pending
  serviceName: mongodb
  podManagementPolicy: OrderedReady
  updateStrategy:
    type: RollingUpdate
  revisionHistoryLimit: 10

database pods mongodb container just prints this every now and then, nothing else:

{"t":{"$date":"2022-03-04T12:29:33.133+00:00"},"s":"I",  "c":"SHARDING", "id":20997,   "ctx":"conn16","msg":"Refreshed RWC defaults","attr":{"newDefaults":{}}}
{"t":{"$date":"2022-03-04T12:29:33.147+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"100.64.3.127:60424","uuid":"1363e560-0c06-4425-a580-287d86391fed","connectionId":285,"connectionCount":13}}
{"t":{"$date":"2022-03-04T12:29:33.147+00:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn285","msg":"client metadata","attr":{"remote":"100.64.3.127:60424","client":"conn285","doc":{"driver":{"name":"mongo-go-driver","version":"v1.3.8+prerelease"},"os":{"type":"linux","architecture":"amd64"},"platform":"go1.15.9","application":{"name":"MongoDB Automation Agent v11.0.5.6963 (git: 65ba9b52de741e00e747a80f1a93442f9fe40695)"}}}}
{"t":{"$date":"2022-03-04T12:29:33.149+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"100.64.3.127:60426","uuid":"667aa6cc-3ea7-4e68-84f6-03ce2e71fcae","connectionId":286,"connectionCount":14}}
{"t":{"$date":"2022-03-04T12:29:33.149+00:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn286","msg":"client metadata","attr":{"remote":"100.64.3.127:60426","client":"conn286","doc":{"driver":{"name":"mongo-go-driver","version":"v1.3.8+prerelease"},"os":{"type":"linux","architecture":"amd64"},"platform":"go1.15.9","application":{"name":"MongoDB Automation Agent v11.0.5.6963 (git: 65ba9b52de741e00e747a80f1a93442f9fe40695)"}}}}
{"t":{"$date":"2022-03-04T12:29:33.177+00:00"},"s":"I",  "c":"ACCESS",   "id":20250,   "ctx":"conn286","msg":"Authentication succeeded","attr":{"mechanism":"SCRAM-SHA-256","speculative":false,"principalName":"__system","authenticationDatabase":"local","remote":"100.64.3.127:60426","extraInfo":{}}}
{"t":{"$date":"2022-03-04T12:29:33.178+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn286","msg":"Connection ended","attr":{"remote":"100.64.3.127:60426","uuid":"667aa6cc-3ea7-4e68-84f6-03ce2e71fcae","connectionId":286,"connectionCount":13}}
{"t":{"$date":"2022-03-04T12:29:33.178+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn285","msg":"Connection ended","attr":{"remote":"100.64.3.127:60424","uuid":"1363e560-0c06-4425-a580-287d86391fed","connectionId":285,"connectionCount":12}}
philip-nicholls commented 2 years ago

This is the same issue as others on here. Just confirmed it exists in the latest readinessprobe 1.0.8 and operator chart version 0.7.3. The readinessprobe executable fails with exit code 1. The binary that is copied from the mongodb-kubernetes-readinessprobe image is 38MB and outputs absolutely nothing, so it's very hard to debug. I can't find the source for it anywhere.

If you replace the binary with an empty script (so that exit code=0), everything starts working!

Note: My cluster is in namespace "production"

kubectl exec -ti mongodb-0 -n production -c mongodb-agent -- bash -c 'mv /opt/scripts/readinessprobe /opt/scripts/readinessprobe_old'
kubectl exec -ti mongodb-0 -n production -c mongodb-agent -- bash -c 'echo "#!/usr/bin/env bash" > /opt/scripts/readinessprobe'

Removing the readiness probe is obviously not a solution for this. I'm guessing it's a bug in readinessprobe, but without any output or source code it's hard to say.

LoonyRules commented 2 years ago

Sadly even hacking around the file's output didn't work for me. To begin with, accessing the pod shell indicates that the readinessprobe file just simply does not exist. After adding your commands, the new file still continues to not exist and I am unable to make any file in the directory myself as well. Perhaps this underlying issue is permissions related? Not sure...

I've updated the docker images too to their latest tags but to no avail. One thing I have noticed is that this is not a rare issue. Some users on the bitnami mongodb helm chart have this same underlying issue as this operator chart does. One place I don't seem to see this issue occurring on is the enterprise operator, of which you have to PAY for.

philip-nicholls commented 2 years ago

Are you sure you are in the correct container? I was pulling my hair out over this. A missing /opt/scripts folder was one of the symptoms I remember. If you don't specify the "mongodb-agent" container, it will take you to the "mongod" one which has no /opt/scripts.

If you have an /opt/scripts folder that is empty, it sounds like something really messed up with the init containers. In which case, your logs might help.

LoonyRules commented 2 years ago

Doing my weekly look back at this issue to see if it's resolved and sadly it is not. Kind of sad that it seems to have been an issue for a long time and isn't being treated as crucial for this operator's functionality...

Are you sure you are in the correct container? I was pulling my hair out over this. A missing /opt/scripts folder was one of the symptoms I remember. If you don't specify the "mongodb-agent" container, it will take you to the "mongod" one which has no /opt/scripts.

If you have an /opt/scripts folder that is empty, it sounds like something really messed up with the init containers. In which case, your logs might help.

When just adding your commands to the commands the container executes on boot it didn't result in a fix. By executing your commands directly, it results in an even worse error:

(combined from similar events): Readiness probe errored: rpc error: code =
  Unknown desc = failed to exec in container: failed to start exec
  "<some random uid>": OCI
  runtime exec failed: exec failed: container_linux.go:380: starting container
  process caused: exec: "/opt/scripts/readinessprobe": stat
  /opt/scripts/readinessprobe: no such file or directory: unknow
SaurabhSK123 commented 2 years ago

same issue occur with me anyone had solve this issue

LoonyRules commented 2 years ago

same issue occur with me anyone had solve this issue

As @philip-nicholls mentioned you can try the scripts he made but they didn't work for me so I've quite literally had to disable readiness probes for mongo completely.

@priyolahiri I can see that you triaged this issue, is there literally any ETA you can provide on the internal status of this issue? Leaving the community in the dark about such a crucial issue is only ever going to give MongoDB itself a BAD reputation.

SaurabhSK123 commented 2 years ago

How you disable readiness probe

Belly-Buster commented 2 years ago

Best approach in the interim is to go back to v0.7.1 which seems reliable.

git clone --branch v0.7.1 --single-branch https://github.com/mongodb/mongodb-kubernetes-operator.git

BryanDollery commented 2 years ago

What's happening with this? Is there a workaround? It's crippling me.

guitcastro commented 2 years ago

I ssh to pod and run the redness script measuring the execution time:

I have no name!@gmt-importer-mongodb-1:/$ time /opt/scripts/readinessprobe

real    0m10.070s
user    0m0.033s
sys     0m0.006s

Found out that is actually taking 10s to execute and the timeout was set to 1, then I found out in the documentation:

Under some circumstances it might be necessary to set your own custom values for the ReadinessProbe used by the MongoDB Community Operator. To do so, you should use the statefulSet attribute in resource.spec, as in the following provided example yaml file. Only those attributes passed will be set, for instance, given the following structure:

spec:
  statefulSet:
    spec:
      template:
        spec:
          containers:
            - name: mongodb-agent
              readinessProbe:
                failureThreshold: 40
                initialDelaySeconds: 5
                timeout: 30

EDIT

Never mind it worked for a while, now it´s returning 1 and failing

nuvme-devops commented 2 years ago

Did anyone manage to solve it?

cdivitotawela commented 2 years ago

Found the solution in issue https://github.com/mongodb/mongodb-kubernetes-operator/issues/651. At least in my situation I was using the same scram credential secret name for multiple users. This configuration make the operator to keep on generating the credentials and operator never complete the reconciliation. After I started using different secrets all good. readinessprobe started to work.

@nuvme-devops @BryanDollery @LoonyRules check whether you make the same mistake as I did. This might be the fix.

nkreiger commented 2 years ago

@LoonyRules I ran into this issue on a GKE cluster...

  1. script didn't work for me
  2. I didn't have the same problem as @cdivitotawela

the only thing that worked for me was setting up a completely new installation (different name on the community CRD)...so that it also set up a new PV, and PVC, SVC, etc instead of using anything existing on the node file system...I think there is some issue/conflict there that causes the MongoDB container to crash

Rid commented 2 years ago

Just bumped into this issue when I started using the operator today.

The source for the readiness probe is here: https://github.com/mongodb/mongodb-kubernetes-operator/tree/master/cmd/readiness

I haven't had a chance to debug it yet, but will make a PR if I find the issue.

mossad-zika commented 2 years ago

@guitcastro is right about the source of this problem

I put only one user in manifest and the issue disappeared

github-actions[bot] commented 1 year ago

This issue is being marked stale because it has been open for 60 days with no activity. Please comment if this issue is still affecting you. If there is no change, this issue will be closed in 30 days.

moatorres commented 1 year ago

Exactly same issue here.

slaskawi commented 1 year ago

We have a bunch of tests in this area, which seem to be fine. Having said that, I believe this is something in your environment. Could you please try to debug this further.

Also, please check the output of the kubectl describe pod <a pod that has problems with the probe>. The readiness check output should be written somewhere in the events section.

mossad-zika commented 1 year ago

@slaskawi have you tried to configure more than 1 user via manifest?

moatorres commented 1 year ago

We have a bunch of tests in this area, which seem to be fine. Having said that, I believe this is something in your environment. Could you please try to debug this further.

Also, please check the output of the kubectl describe pod <a pod that has problems with the probe>. The readiness check output should be written somewhere in the events section.

Will do. I might be able to reproduce all steps, from cluster creation (GKE) up to the ReplicaSet deployment. Is there anything else specifically that you'd like me to debug? Happy to help.

slaskawi commented 1 year ago

@moatorres I actually noticed that I probably gave you wrong guidance to check the readiness output. Instead of doing kubectl describe, please check the logs in the /var/log/mongodb-mms-automation/readiness.log and post also the /var/log/mongodb-mms-automation/agent-health-status.json file.

ammurdoch commented 1 year ago

@slaskawi I am also having this problem, only with one of the 3 instances of my Mongo DB. /var/log/mongodb-mms-automation/readiness.log says

2022-12-17T22:44:42.449Z    INFO    build/main.go:71    Mongod is not ready
2022-12-17T22:44:52.439Z    INFO    build/main.go:71    Mongod is not ready
2022-12-17T22:45:02.433Z    INFO    build/main.go:71    Mongod is not ready

tail -f /healthstatus/agent-health-status.json says

{"statuses":{"miles-davis-mongo-db-1":{"IsInGoalState":false,"LastMongoUpTime":0,"ExpectedToBeUp":true,"ReplicationStatus":-1}},"mmsStatus":{"miles-davis-mongo-db-1":{"name":"miles-davis-mongo-db-1","lastGoalVersionAchieved":-1,"plans":[{"started":"2022-12-17T21:45:55.077231307Z","completed":null,"moves":[{"move":"Start","moveDoc":"Start the process","steps":[{"step":"StartFresh","stepDoc":"Start a mongo instance  (start fresh)","isWaitStep":false,"started":"2022-12-17T21:45:55.077256988Z","completed":null,"result":"error"}]},{"move":"WaitRsInit","moveDoc":"Wait for the replica set to be initialized by another member","steps":[{"step":"WaitRsInit","stepDoc":"Wait for the replica set to be initialized by another member","isWaitStep":true,"started":null,"completed":null,"result":""}]},{"move":"WaitFeatureCompatibilityVersionCorrect","moveDoc":"Wait for featureCompatibilityVersion to be right","steps":[{"step":"WaitFeatureCompatibilityVersionCorrect","stepDoc":"Wait for featureCompatibilityVersion to be right","isWaitStep":true,"started":null,"completed":null,"result":""}]}]}],"errorCode":0,"errorString":""}}}

Any help would be greatly appreciated.

slaskawi commented 1 year ago

@ammurdoch Looking at the longs, we can clearly see that Mongod didn't start successfully:

{"step":"StartFresh","stepDoc":"Start a mongo instance  (start fresh)","isWaitStep":false,"started":"2022-12-17T21:45:55.077256988Z","completed":null,"result":"error"}

Could you please look into the /var/log/mongodb-mms-automation directory and check whether there's anything suspicious there? There needs to be some trace why Mongod didn't start (or crashed?).

ammurdoch commented 1 year ago

Hey @slaskawi, I really appreciate your reply. I was able to solve my problem. I was experiencing something related to "Slow Application of Oplog Entries". See https://www.mongodb.com/docs/manual/tutorial/troubleshoot-replica-sets/#slow-application-of-oplog-entries

Mongo wouldn't start up until it worked through half a years worth of op_logs. I imagine it was hitting some error at the end of the op_logs and kept restarting every several hours. This particular node had restarted ~4000 times since November.

It's still unclear to me what exactly was going on, but the solution was to delete everything in that particular mongo instance's /data directory and let it refresh its data from the other two nodes instead of from op_logs.

Once I deleted everything in /data it restarted and refreshed its data without any more trouble.

slaskawi commented 1 year ago

I'm glad this helped you @ammurdoch

Just for the future, if the readiness probe is causing problems and you're certain it's fine (like in this case), please use the statefulset override and disable the probe by returning true from it. Alternatively, tweak its settings to match your needs.

I'm solving the ticket. Thanks for patience!

wdonne commented 1 year ago

After fixing a privilege action error the readinessprobe returned 0 instead of 1. Then I set the readinessprobe.timeoutSeconds to 30, because it was only 1, which is very low. If you run it in a shell connected to the mongodb-agent container it takes more than a second. Now it works.

I also use Kyverno and it seems that admission reports are being generated constantly.

steddyman commented 1 year ago

Found the solution in issue #651. At least in my situation I was using the same scram credential secret name for multiple users. This configuration make the operator to keep on generating the credentials and operator never complete the reconciliation. After I started using different secrets all good. readinessprobe started to work.

@nuvme-devops @BryanDollery @LoonyRules check whether you make the same mistake as I did. This might be the fix.

I can confirm this was also the reason for my problems with the pods restarting. I had 3 users defined in the replicaset yaml, all of which had the same name for the scramCredentialsSecretName property.

I've been trying to solve this for 2 days, and I had noticed the connection secrets were not being created, but assumed the documentation was not upto date.

Thanks for the info.

code-uri commented 1 year ago

Still facing the same issue. any updates?

code-uri commented 1 year ago

@steddyman Thanks you. found the solution here #651. I was using same secret for two users. remove the user resolved the problem.

vipcxj commented 1 year ago

@guitcastro Same problem, do you find out why healthy check so slow?

pedrocwb commented 6 months ago

@cdivitotawela, my MongoDB Community replica set was in the pending state, and the MongoDB pods were failing to initialize the agent. The fix you suggested nailed it! I had also configured two users with the same SCRAM secret. Changing the secret name solved the issue. Now, the replica set is running, and the pods are healthy. Thank you!