kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.53k stars 1.59k forks source link

[backend] ml-pipeline deployment readiness probe failed #7271

Closed yuhuishi-convect closed 4 months ago

yuhuishi-convect commented 2 years ago

Environment

Steps to reproduce

Expected result

Materials and Reference

The liveness probe of the `ml-pipeline` deployment failed. ``` $ k describe -n kubeflow pod ml-pipeline-5f465d4c56-7xcs8 Name: ml-pipeline-5f465d4c56-7xcs8 Namespace: kubeflow Priority: 0 Node: ip-10-0-3-78.us-west-2.compute.internal/10.0.3.78 Start Time: Mon, 07 Feb 2022 11:05:22 -0800 Labels: app=ml-pipeline application-crd-id=kubeflow-pipelines pod-template-hash=5f465d4c56 Annotations: kubectl.kubernetes.io/restartedAt: 2022-02-06T17:31:44-08:00 kubernetes.io/psp: eks.privileged sidecar.istio.io/inject: false Status: Running IP: 10.0.3.52 IPs: IP: 10.0.3.52 Controlled By: ReplicaSet/ml-pipeline-5f465d4c56 Containers: ml-pipeline-api-server: Container ID: docker://6659ead43604634288ebe7987ba5f41e892e06c568645b2883547b3c26cdb167 Image: gcr.io/ml-pipeline/api-server:1.2.0 Image ID: docker-pullable://gcr.io/ml-pipeline/api-server@sha256:6553e9855e6d38eb5a70beeea39a2c37ac85b60f26a5c061b5e5e2adfffd960b Ports: 8888/TCP, 8887/TCP Host Ports: 0/TCP, 0/TCP State: Running Started: Mon, 07 Feb 2022 11:05:23 -0800 Ready: False Restart Count: 0 Liveness: exec [wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3 Readiness: exec [wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3 Environment: AUTO_UPDATE_PIPELINE_DEFAULT_VERSION: Optional: false POD_NAMESPACE: kubeflow (v1:metadata.namespace) OBJECTSTORECONFIG_SECURE: false OBJECTSTORECONFIG_BUCKETNAME: Optional: false DBCONFIG_USER: Optional: false DBCONFIG_PASSWORD: Optional: false DBCONFIG_DBNAME: Optional: false DBCONFIG_HOST: Optional: false DBCONFIG_PORT: Optional: false OBJECTSTORECONFIG_ACCESSKEY: Optional: false OBJECTSTORECONFIG_SECRETACCESSKEY: Optional: false Mounts: /var/run/secrets/kubernetes.io/serviceaccount from ml-pipeline-token-zvqgd (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: ml-pipeline-token-zvqgd: Type: Secret (a volume populated by a Secret) SecretName: ml-pipeline-token-zvqgd Optional: false QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 12m default-scheduler Successfully assigned kubeflow/ml-pipeline-5f465d4c56-7xcs8 to ip-10-0-3-78.us-west-2.compute.internal Normal Pulled 12m kubelet Container image "gcr.io/ml-pipeline/api-server:1.2.0" already present on machine Normal Created 12m kubelet Created container ml-pipeline-api-server Normal Started 12m kubelet Started container ml-pipeline-api-server Warning Unhealthy 6s (x2 over 6m12s) kubelet Readiness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded Warning Unhealthy 4s (x2 over 6m10s) kubelet Liveness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded ```

Logs of the pod

$ k logs -n kubeflow ml-pipeline-5f465d4c56-7xcs8                   
I0207 19:05:23.824447       9 client_manager.go:140] Initializing client manager
I0207 19:05:23.824841       9 config.go:56] Config DBConfig.ExtraParams not specified, skipping

Executing the health check from the pod receives no response

k exec -n kubeflow ml-pipeline-5f465d4c56-7xcs8 -- wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz
Deployment yaml of the `ml-pipeline` ``` $ k get deploy -n kubeflow ml-pipeline -o yaml apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "5" kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"labels":{"app":"ml-pipeline","application-crd-id":"kubeflow-pipelines"},"name":"ml-pipeline","namespace":"kubeflow"},"spec":{"selector":{"matchLabels":{"app":"ml-pipeline","application-crd-id":"kubeflow-pipelines"}},"template":{"metadata":{"labels":{"app":"ml-pipeline","application-crd-id":"kubeflow-pipelines"}},"spec":{"containers":[{"env":[{"name":"AUTO_UPDATE_PIPELINE_DEFAULT_VERSION","valueFrom":{"configMapKeyRef":{"key":"autoUpdatePipelineDefaultVersion","name":"pipeline-install-config-d42hc87dh2"}}},{"name":"POD_NAMESPACE","valueFrom":{"fieldRef":{"fieldPath":"metadata.namespace"}}},{"name":"OBJECTSTORECONFIG_SECURE","value":"false"},{"name":"OBJECTSTORECONFIG_BUCKETNAME","valueFrom":{"configMapKeyRef":{"key":"bucketName","name":"pipeline-install-config-d42hc87dh2"}}},{"name":"DBCONFIG_USER","valueFrom":{"secretKeyRef":{"key":"username","name":"mysql-secret-fd5gktm75t"}}},{"name":"DBCONFIG_PASSWORD","valueFrom":{"secretKeyRef":{"key":"password","name":"mysql-secret-fd5gktm75t"}}},{"name":"DBCONFIG_DBNAME","valueFrom":{"configMapKeyRef":{"key":"pipelineDb","name":"pipeline-install-config-d42hc87dh2"}}},{"name":"DBCONFIG_HOST","valueFrom":{"configMapKeyRef":{"key":"dbHost","name":"pipeline-install-config-d42hc87dh2"}}},{"name":"DBCONFIG_PORT","valueFrom":{"configMapKeyRef":{"key":"dbPort","name":"pipeline-install-config-d42hc87dh2"}}},{"name":"OBJECTSTORECONFIG_ACCESSKEY","valueFrom":{"secretKeyRef":{"key":"accesskey","name":"mlpipeline-minio-artifact"}}},{"name":"OBJECTSTORECONFIG_SECRETACCESSKEY","valueFrom":{"secretKeyRef":{"key":"secretkey","name":"mlpipeline-minio-artifact"}}}],"image":"gcr.io/ml-pipeline/api-server:1.2.0","imagePullPolicy":"IfNotPresent","livenessProbe":{"exec":{"command":["wget","-q","-S","-O","-","http://localhost:8888/apis/v1beta1/healthz"]},"initialDelaySeconds":3,"periodSeconds":5,"timeoutSeconds":2},"name":"ml-pipeline-api-server","ports":[{"containerPort":8888,"name":"http"},{"containerPort":8887,"name":"grpc"}],"readinessProbe":{"exec":{"command":["wget","-q","-S","-O","-","http://localhost:8888/apis/v1beta1/healthz"]},"initialDelaySeconds":3,"periodSeconds":5,"timeoutSeconds":2}}],"serviceAccountName":"ml-pipeline"}}}} creationTimestamp: "2021-01-15T22:01:56Z" generation: 15 labels: app: ml-pipeline application-crd-id: kubeflow-pipelines name: ml-pipeline namespace: kubeflow ownerReferences: - apiVersion: app.k8s.io/v1beta1 blockOwnerDeletion: true controller: false kind: Application name: pipeline uid: ea8a9b37-0c16-439e-bc49-3399051aca6e resourceVersion: "532602378" selfLink: /apis/apps/v1/namespaces/kubeflow/deployments/ml-pipeline uid: 908e252d-c7c6-49f2-88e0-dcf568097b14 spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app: ml-pipeline application-crd-id: kubeflow-pipelines strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: annotations: kubectl.kubernetes.io/restartedAt: "2022-02-06T17:31:44-08:00" sidecar.istio.io/inject: "false" creationTimestamp: null labels: app: ml-pipeline application-crd-id: kubeflow-pipelines spec: containers: - env: - name: AUTO_UPDATE_PIPELINE_DEFAULT_VERSION valueFrom: configMapKeyRef: key: autoUpdatePipelineDefaultVersion name: pipeline-install-config-d42hc87dh2 - name: POD_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: OBJECTSTORECONFIG_SECURE value: "false" - name: OBJECTSTORECONFIG_BUCKETNAME valueFrom: configMapKeyRef: key: bucketName name: pipeline-install-config-d42hc87dh2 - name: DBCONFIG_USER valueFrom: secretKeyRef: key: username name: mysql-secret-fd5gktm75t - name: DBCONFIG_PASSWORD valueFrom: secretKeyRef: key: password name: mysql-secret-fd5gktm75t - name: DBCONFIG_DBNAME valueFrom: configMapKeyRef: key: pipelineDb name: pipeline-install-config-d42hc87dh2 - name: DBCONFIG_HOST valueFrom: configMapKeyRef: key: dbHost name: pipeline-install-config-d42hc87dh2 - name: DBCONFIG_PORT valueFrom: configMapKeyRef: key: dbPort name: pipeline-install-config-d42hc87dh2 - name: OBJECTSTORECONFIG_ACCESSKEY valueFrom: secretKeyRef: key: accesskey name: mlpipeline-minio-artifact - name: OBJECTSTORECONFIG_SECRETACCESSKEY valueFrom: secretKeyRef: key: secretkey name: mlpipeline-minio-artifact image: gcr.io/ml-pipeline/api-server:1.2.0 imagePullPolicy: IfNotPresent livenessProbe: exec: command: - wget - -q - -S - -O - '-' - http://localhost:8888/apis/v1beta1/healthz failureThreshold: 3 initialDelaySeconds: 3 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 2 name: ml-pipeline-api-server ports: - containerPort: 8888 name: http protocol: TCP - containerPort: 8887 name: grpc protocol: TCP readinessProbe: exec: command: - wget - -q - -S - -O - '-' - http://localhost:8888/apis/v1beta1/healthz failureThreshold: 3 initialDelaySeconds: 3 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 2 resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} serviceAccount: ml-pipeline serviceAccountName: ml-pipeline terminationGracePeriodSeconds: 30 status: conditions: - lastTransitionTime: "2022-02-07T18:58:32Z" lastUpdateTime: "2022-02-07T18:58:32Z" message: Deployment does not have minimum availability. reason: MinimumReplicasUnavailable status: "False" type: Available - lastTransitionTime: "2022-02-07T19:16:46Z" lastUpdateTime: "2022-02-07T19:16:46Z" message: ReplicaSet "ml-pipeline-5f465d4c56" has timed out progressing. reason: ProgressDeadlineExceeded status: "False" type: Progressing observedGeneration: 15 replicas: 2 unavailableReplicas: 2 updatedReplicas: 1 ```

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

zijianjoy commented 2 years ago

Hello @yuhuishi-convect , can you provide more information about other deployments in your cluster?

ml-pipeline is the last Deployment that can be ready only when other Deployments are running. Possible reason is that your storage client is failing (SQL database, etc.), which caused the ml-pipeline also failing. Can you share more information about the healthiness of your other Deployments in the cluster?

zijianjoy commented 2 years ago

May I ask which Kubernetes version you are deploying to? Similar post: https://github.com/kubernetes/kubernetes/issues/106111

yuhuishi-convect commented 2 years ago

Hello @yuhuishi-convect , can you provide more information about other deployments in your cluster?

ml-pipeline is the last Deployment that can be ready only when other Deployments are running. Possible reason is that your storage client is failing (SQL database, etc.), which caused the ml-pipeline also failing. Can you share more information about the healthiness of your other Deployments in the cluster?


$ k get pods -n kubeflow-helm 
NAME                                               READY   STATUS    RESTARTS   AGE
cache-deployer-deployment-bb8d6cb65-9hqfb          1/1     Running   0          10m
cache-server-7fffdd889d-zgnc9                      1/1     Running   0          10m
metadata-envoy-7cd8b6db48-nw6w8                    1/1     Running   0          10m
metadata-grpc-deployment-69995cb9dc-lq9c8          1/1     Running   1          10m
metadata-writer-5986bfb78-v7dwr                    1/1     Running   0          10m
minio-5cd667bc76-2965c                             1/1     Running   0          10m
ml-pipeline-5ffbcfcd95-wjhvn                       0/1     Running   5          4m12s
ml-pipeline-persistenceagent-84fdcf9cbc-pq2nv      1/1     Running   4          10m
ml-pipeline-scheduledworkflow-59d66b54c6-qc957     1/1     Running   0          10m
ml-pipeline-ui-58d56bd7cc-mvzcl                    1/1     Running   0          10m
ml-pipeline-viewer-crd-856f5454d8-hkk65            1/1     Running   0          10m
ml-pipeline-visualizationserver-5486886667-c62pr   1/1     Running   0          10m
mysql-85445f56b7-b7fp5                             1/1     Running   0          11m
workflow-controller-7f469d8fcd-c6fzn               1/1     Running   0          10m
yuhuishi-convect commented 2 years ago

May I ask which Kubernetes version you are deploying to? Similar post: kubernetes/kubernetes#106111

$ k version                   
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.3", GitCommit:"c92036820499fedefec0f847e2054d824aea6cd1", GitTreeState:"clean", BuildDate:"2021-10-27T18:41:28Z", GoVersion:"go1.16.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9", GitTreeState:"clean", BuildDate:"2021-10-29T23:32:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
zijianjoy commented 2 years ago

@yuhuishi-convect

The KFP backend 1.2 is very old version, it might not work in Kubernetes 1.21. Can you try to deploy KFP backend v1.8.1 instead? https://github.com/kubeflow/pipelines/releases/tag/1.8.1

rimolive commented 4 months ago

Closing this issue, KFP 2.0.5 is available. Feel free to reopen it if the issue persists in the latest version.

/close

google-oss-prow[bot] commented 4 months ago

@rimolive: Closing this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/7271#issuecomment-1991037992): >Closing this issue, KFP 2.0.5 is available. Feel free to reopen it if the issue persists in the latest version. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.