kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.63k stars 1.64k forks source link

[backend] Cannot list artifacts #8189

Closed pablofiumara closed 8 months ago

pablofiumara commented 2 years ago

Environment

Using https://www.kubeflow.org/docs/distributions/gke/deploy/upgrade/

Steps to reproduce

Upgrade from Kubeflow 1.3 to Kubeflow 1.5 allows to replicate the problem

Expected result

I expect to be able to see a list of artifacts when I access myClusterURL/pipeline/artifacts. Instead I get this https://user-images.githubusercontent.com/74205824/186285977-cba538c2-e496-416e-8f27-67fa4950b4cc.png

Materials and Reference


Impacted by this bug? Give it a 👍.

zijianjoy commented 2 years ago

Have you checked that both ml-pipeline-ui deployment in kubeflow namespace and the ml-pipeline-ui-artifact deployment in user namespaces are all using ml-pipeline/frontend:1.8.1?

pablofiumara commented 2 years ago

@zijianjoy Yes, I have

Name:                   ml-pipeline-ui
Namespace:              kubeflow
CreationTimestamp:      Wed, 23 Jun 2021 21:52:54 -0300
Labels:                 app=ml-pipeline-ui
                        app.kubernetes.io/component=ml-pipeline
                        app.kubernetes.io/name=kubeflow-pipelines
Annotations:            deployment.kubernetes.io/revision: 21
Selector:               app=ml-pipeline-ui,app.kubernetes.io/component=ml-pipeline,app.kubernetes.io/name=kubeflow-pipelines
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app=ml-pipeline-ui
                    app.kubernetes.io/component=ml-pipeline
                    app.kubernetes.io/name=kubeflow-pipelines
  Annotations:      cluster-autoscaler.kubernetes.io/safe-to-evict: true
                    kubectl.kubernetes.io/restartedAt: 2022-08-25T18:19:01-03:00
  Service Account:  ml-pipeline-ui
  Containers:
   ml-pipeline-ui:
    Image:      gcr.io/ml-pipeline/frontend:1.8.1
    Port:       3000/TCP
    Host Port:  0/TCP
    Requests:
      cpu:      10m
      memory:   70Mi
    Liveness:   exec [wget -q -S -O - http://localhost:3000/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3
    Readiness:  exec [wget -q -S -O - http://localhost:3000/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3
    Environment:
      KUBEFLOW_USERID_HEADER:                     <set to the key 'userid-header' of config map 'kubeflow-config'>  Optional: false
      KUBEFLOW_USERID_PREFIX:                     <set to the key 'userid-prefix' of config map 'kubeflow-config'>  Optional: false
      VIEWER_TENSORBOARD_POD_TEMPLATE_SPEC_PATH:  /etc/config/viewer-pod-template.json
      DEPLOYMENT:                                 KUBEFLOW
      ARTIFACTS_SERVICE_PROXY_NAME:               ml-pipeline-ui-artifact
      ARTIFACTS_SERVICE_PROXY_PORT:               80
      ARTIFACTS_SERVICE_PROXY_ENABLED:            true
      ENABLE_AUTHZ:                               true
      MINIO_NAMESPACE:                             (v1:metadata.namespace)
      MINIO_ACCESS_KEY:                           <set to the key 'accesskey' in secret 'mlpipeline-minio-artifact'>  Optional: false
      MINIO_SECRET_KEY:                           <set to the key 'secretkey' in secret 'mlpipeline-minio-artifact'>  Optional: false
      ALLOW_CUSTOM_VISUALIZATIONS:                true
    Mounts:
      /etc/config from config-volume (ro)
  Volumes:
   config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      ml-pipeline-ui-configmap
    Optional:  false
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   ml-pipeline-ui-oneId (1/1 replicas created)
Events:          <none>

Name:                   ml-pipeline-ui-artifact
Namespace:              myNamespace
CreationTimestamp:      Mon, 13 Jun 2022 17:20:27 -0300
Labels:                 app=ml-pipeline-ui-artifact
                        controller-uid=34641e66-4d49-4025-b235-fc433a8e2049
Annotations:            deployment.kubernetes.io/revision: 4
                        metacontroller.k8s.io/last-applied-configuration:
                          {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"labels":{"app":"ml-pipeline-ui-artifact","controller-uid":"34641e66-4d49-4025-b23...
Selector:               app=ml-pipeline-ui-artifact
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app=ml-pipeline-ui-artifact
  Annotations:      kubectl.kubernetes.io/restartedAt: 2022-08-23T18:23:11-03:00
  Service Account:  default-editor
  Containers:
   ml-pipeline-ui-artifact:
    Image:      gcr.io/ml-pipeline/frontend:1.8.1
    Port:       3000/TCP
    Host Port:  0/TCP
    Limits:
      cpu:     100m
      memory:  500Mi
    Requests:
      cpu:     10m
      memory:  70Mi
    Environment:
      MINIO_ACCESS_KEY:  <set to the key 'accesskey' in secret 'mlpipeline-minio-artifact'>  Optional: false
      MINIO_SECRET_KEY:  <set to the key 'secretkey' in secret 'mlpipeline-minio-artifact'>  Optional: false
    Mounts:              <none>
  Volumes:               <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   ml-pipeline-ui-artifact-bb5bc4b57 (1/1 replicas created)
Events:          <none>

What else can I check?

pablofiumara commented 2 years ago

If I go to myCluster/ml_metadata.MetadataStoreService/GetEventsByArtifactIDs, I get the message

upstream connect error or disconnect/reset before headers. reset reason: remote reset

Using asm-1143-0

zijianjoy commented 2 years ago

ml-metadata has been upgraded from 1.0.0 to 1.5.0 when Kubeflow is upgraded from 1.3 to 1.5. https://github.com/kubeflow/pipelines/commits/master/third_party/ml-metadata

As a result, MLMD schema version has been changed. So you need to follow the instruction to upgrade MLMD dependency: https://github.com/google/ml-metadata/blob/master/g3doc/get_started.md#upgrade-the-mlmd-library

pablofiumara commented 2 years ago

@zijianjoy Thank you very much for your answer. If I execute

kubectl describe deployment metadata-grpc-deployment -n kubeflow

I get


Name:                   metadata-grpc-deployment
Namespace:              kubeflow
CreationTimestamp:      Wed, 23 Jun 2021 21:52:53 -0300
Labels:                 component=metadata-grpc-server
Annotations:            deployment.kubernetes.io/revision: 27
Selector:               component=metadata-grpc-server
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           component=metadata-grpc-server
  Annotations:      kubectl.kubernetes.io/restartedAt: 2022-08-26T16:44:45-03:00
  Service Account:  metadata-grpc-server
  Containers:
   container:
    Image:      gcr.io/tfx-oss-public/ml_metadata_store_server:1.5.0
    Port:       8080/TCP
    Host Port:  0/TCP
    Command:
      /bin/metadata_store_server
    Args:
      --grpc_port=8080
      --mysql_config_database=$(MYSQL_DATABASE)
      --mysql_config_host=$(MYSQL_HOST)
      --mysql_config_port=$(MYSQL_PORT)
      --mysql_config_user=$(DBCONFIG_USER)
      --mysql_config_password=$(DBCONFIG_PASSWORD)
      --enable_database_upgrade=true
    Liveness:   tcp-socket :grpc-api delay=3s timeout=2s period=5s #success=1 #failure=3
    Readiness:  tcp-socket :grpc-api delay=3s timeout=2s period=5s #success=1 #failure=3
    Environment:
      DBCONFIG_USER:      <set to the key 'username' in secret 'mysql-secret'>               Optional: false
      DBCONFIG_PASSWORD:  <set to the key 'password' in secret 'mysql-secret'>               Optional: false
      MYSQL_DATABASE:     <set to the key 'mlmdDb' of config map 'pipeline-install-config'>  Optional: false
      MYSQL_HOST:         <set to the key 'dbHost' of config map 'pipeline-install-config'>  Optional: false
      MYSQL_PORT:         <set to the key 'dbPort' of config map 'pipeline-install-config'>  Optional: false
    Mounts:               <none>
  Volumes:                <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   metadata-grpc-deployment-56779cf65 (1/1 replicas created)
Events:
  Type    Reason             Age    From                   Message
  ----    ------             ----   ----                   -------
  Normal  ScalingReplicaSet  50m    deployment-controller  Scaled up replica set metadata-grpc-deployment-bb6856f48 to 1
  Normal  ScalingReplicaSet  48m    deployment-controller  Scaled down replica set metadata-grpc-deployment-58c7dbcd8b to 0
  Normal  ScalingReplicaSet  39m    deployment-controller  Scaled up replica set metadata-grpc-deployment-6cc4b76c8d to 1
  Normal  ScalingReplicaSet  38m    deployment-controller  Scaled down replica set metadata-grpc-deployment-bb6856f48 to 0
  Normal  ScalingReplicaSet  36m    deployment-controller  Scaled up replica set metadata-grpc-deployment-8c74d44b5 to 1
  Normal  ScalingReplicaSet  35m    deployment-controller  Scaled down replica set metadata-grpc-deployment-6cc4b76c8d to 0
  Normal  ScalingReplicaSet  2m53s  deployment-controller  Scaled up replica set metadata-grpc-deployment-56779cf65 to 1
  Normal  ScalingReplicaSet  2m19s  deployment-controller  Scaled down replica set metadata-grpc-deployment-8c74d44b5 to 0

Does this mean MLMD dependency version is correct? What am I missing?

zijianjoy commented 2 years ago

You need to upgrade the MLMD database schema: https://github.com/google/ml-metadata/blob/master/g3doc/get_started.md#upgrade-the-database-schema

zijianjoy commented 2 years ago

There is a tool for MLMD upgrade: https://github.com/kubeflow/pipelines/blob/74c7773ca40decfd0d4ed40dc93a6af591bbc190/tools/metadatastore-upgrade/README.md

celiawa commented 2 years ago

Hi @zijianjoy, Our cluster is a freshly installed 1.5.0 kubeflow cluster.

We also see the below error page when accessing myClusterURL/pipeline/artifacts. image

In the beginning, the artifacts page can be loaded successfully, but after we ran about 600 recurring runs, the artifacts page failed to load with the above message.

Even we removed all the content under mlpipeline/artifacts/ path in minio. The artifacts page still failed to load with the error.

Is there any way to recover? Thanks!

zijianjoy commented 2 years ago

@celiawa Currently it is listing all artifacts from MLMD store. Even if you deleted the content in MinIO, the MLMD store doesn't delete the corresponding MLMD object. It is likely a timeout trying to list all the artifacts. There is a plan to improve this page https://github.com/kubeflow/pipelines/issues/3226

celiawa commented 2 years ago

Thanks @zijianjoy. I checked the mysql got MLMD store, there're many tables in it. Which tables we shall delete to recover our artifacts page back. We don't want to reinstall.

subasathees commented 1 year ago

Hi @zijianjoy @celiawa I am also facing the same issue, unable to see the Artifacts in Kubeflow, let me know solution to fix the same

zijianjoy commented 1 year ago

Upgrading KFP to the latest version should allow you to see paginated artifact list now.

celiawa commented 1 year ago

Thanks @zijianjoy, we upgraded to kfp version 2.01 and can see artifact list pagination now.

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

rimolive commented 8 months ago

Closing this issue as it seems the issue is solved.

/close

google-oss-prow[bot] commented 8 months ago

@rimolive: Closing this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/8189#issuecomment-1990978102): >Closing this issue as it seems the issue is solved. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.