machine-learning-exchange / mlx

Machine Learning eXchange (MLX). Data and AI Assets Catalog and Execution Engine
https://ml-exchange.org/
Apache License 2.0
201 stars 52 forks source link

`mlx-ui` pod fails to start up on OpenShift #343

Closed ckadner closed 2 years ago

ckadner commented 2 years ago

Describe the bug

After deploying MLX on OpenShift (4.8, 4.10 on either IBM Cloud or Fyre)

# export MLX_DEPLOYMENT_TYPE=mlx-single-ibmcloud-openshift
export MLX_DEPLOYMENT_TYPE=mlx-single-fyre-openshift

git clone https://github.com/IBM/manifests -b v1.5-branch && cd manifests

while ! kustomize build ${MLX_DEPLOYMENT_TYPE} | \
  kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

The mlx-ui pod fails to start up:

NAME                                               READY   STATUS              RESTARTS   AGE
cache-deployer-deployment-798dc7d98b-9c4sj         1/1     Running             0          83s
cache-server-86f59c8696-d499g                      0/1     ContainerCreating   0          83s
kfp-csi-s3-4srhx                                   0/2     ContainerCreating   0          81s
kfp-csi-s3-9bqft                                   0/2     ContainerCreating   0          81s
kfp-csi-s3-gklqr                                   0/2     ContainerCreating   0          81s
metadata-envoy-deployment-5b4856dd5-m6t4m          1/1     Running             0          83s
metadata-grpc-deployment-6b5685488-gnszx           1/1     Running             0          83s
metadata-writer-9f698fdcb-x47pd                    1/1     Running             0          83s
minio-5b65df66c9-d257k                             1/1     Running             0          83s
ml-pipeline-77b7b79565-p2wfq                       1/1     Running             0          83s
ml-pipeline-persistenceagent-684f664fb7-q255d      1/1     Running             0          83s
ml-pipeline-scheduledworkflow-5dfcf96788-6mp2n     1/1     Running             0          82s
ml-pipeline-ui-6dfcc5c664-pkgbr                    1/1     Running             0          82s
ml-pipeline-viewer-crd-5878c6454f-mk92c            1/1     Running             0          82s
ml-pipeline-visualizationserver-6876996cdd-s4qvd   1/1     Running             0          82s
mlx-api-7f46b6df4f-xdvzw                           1/1     Running             0          82s
mlx-ui-7fbbbf6cbb-hll4z                            0/1     Error               3          82s
mysql-f7b9b7dd4-75l2q                              1/1     Running             0          82s

We can see exit code 243 in oc describe pod mlx-ui-7fbbbf6cbb-hll4z:

Containers:
  mlx-ui:
    Container ID:   cri-o://5d9d1caa2f3544a78c8b0e2cdc9cba9fc495a7c108ee3443220b417ca8c55d4b
    Image:          mlexchange/mlx-ui:nightly-origin-main
    Image ID:       docker.io/mlexchange/mlx-ui@sha256:70aa61ce62caeeeeaa549420c4684b5e0edb3dc96a8151b11f15939c5fe14152
    Port:           3000/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    243

After deleting the mlx-ui pod, the mlx-ui comes up fine:

$ oc get pods | grep mlx-ui
mlx-ui-7fbbbf6cbb-hll4z                            0/1     CrashLoopBackOff    7          13m

$ oc delete pod mlx-ui-7fbbbf6cbb-hll4z
pod "mlx-ui-7fbbbf6cbb-hll4z" deleted

$ oc get pods | grep mlx-ui
mlx-ui-7fbbbf6cbb-r5kxh                            1/1     Running             0          16s

Thanks @jbusche for verifying this error to be consistent across various OC deployments