kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.53k stars 1.59k forks source link

[backend] Steps not getting cached when input parameters are same #5764

Closed cloudbow closed 4 months ago

cloudbow commented 3 years ago

Environment

kfp 1.5.0 kfp-pipeline-spec 0.1.7 kfp-server-api 1.5.0

Steps to reproduce

I have attached the notebook I used. Please try it with that . The input is already provided. Its a simple add_op pipeline which adds two numbers. But why is the step being executed again and again even if the run is cloned or a new run created using the same pipeline.

Expected result

The steps should have been cached as the input the docker image, the output everything is same.

Materials and Reference

Attached sample [code](simple_function_based_component_pipeline (1).ipynb.zip)


Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

Bobgy commented 3 years ago

Hi @cloudbow, I tried running your example on KFP standalone 1.6.0 on GCP, it is cached as expected after second run.

image

Therefore, this issue might be specific to either your env or MiniKF. Can you use kubectl get pod -n kubeflow kubectl describe pod <pod-name> kubectl logs <pod-name> etc techniques to check your deployment. The key servers to look at is cache-server and cache-deployer. Sth might be failing with them.

Itega commented 3 years ago

Hi @Bobgy , I have the same issue with a different environment so I might provide some informations.

I tried with KFP 1.6/1.5/1.4 standalone on GCP and AI Platform deployed on an existing cluster and never get caching. All deployment are fine including cache-server and cache-deployer but the cachedb is always empty (other db seems fine).

With AI Platform and a new cluster I actually have cache so it may come from the environment ? I tried with private clusters with autoscaling (min 3 nodes, 2vCPU and 4.5GB RAM).

Edit: After further investigations, this seems to be related to the use of a private cluster in my case.

cloudbow commented 3 years ago

@Itega what do you mean by private cluster ? I am running kubeflow on minikf 1.3 from market place. can this also be called private cluster? @Bobgy let me check

cloudbow commented 3 years ago

ubuntu@ip-10-101-8-247:~$ kubectl get po -n kubeflow NAME READY STATUS RESTARTS AGE admission-webhook-deployment-8c9cdf478-q2lmt 2/2 Running 0 7d19h centraldashboard-77cb6bbb48-nktsx 2/2 Running 0 7d19h jupyter-web-app-deployment-75795878-ts9t2 2/2 Running 0 7d19h katib-controller-6d6bb5495d-zc29z 2/2 Running 0 7d19h katib-db-manager-6ff648f5cc-r5mgc 2/2 Running 0 7d19h katib-mysql-6495dccdd5-vpffx 2/2 Running 0 7d19h katib-ui-7ddf4965f9-j49ss 2/2 Running 0 7d19h kfp-cache-7fd4488b7f-t2kcn 3/3 Running 0 7d19h kfserving-controller-manager-0 3/3 Running 0 7d19h kubeflow-reception-7895dd4d69-lxlss 2/2 Running 0 7d19h metadata-db-6bf8b57f97-jqg29 2/2 Running 0 7d19h metadata-envoy-deployment-549d875989-r4kk8 1/1 Running 0 7d19h metadata-grpc-deployment-ccc8c8bd9-rw2xz 2/2 Running 4 7d19h minio-6cfd7cb4f-25zkp 2/2 Running 0 7d19h ml-pipeline-5dc8fff45b-nj76p 2/2 Running 0 7d19h ml-pipeline-persistenceagent-c6b4d475f-hmnwt 2/2 Running 0 7d19h ml-pipeline-scheduledworkflow-64dc954c6c-tzp4x 2/2 Running 0 7d19h ml-pipeline-ui-78846f6754-tmnth 2/2 Running 1 7d19h ml-pipeline-viewer-crd-5ffbd79f68-dx667 2/2 Running 0 7d19h ml-pipeline-visualizationserver-5977df9c45-6xq5x 2/2 Running 0 7d19h models-web-app-7bfdc5c585-rznv7 2/2 Running 0 7d19h mpi-operator-754d876fd8-gppnx 1/1 Running 1 7d19h mxnet-operator-c5f7b6798-gzmxv 1/1 Running 1 7d19h mysql-65ff8d5dfd-wqbbd 2/2 Running 0 7d19h notebook-controller-deployment-7c46fdd957-f957p 2/2 Running 0 7d19h profiles-deployment-588f5fdcf8-26xmv 3/3 Running 0 7d19h pvcviewer-controller-controller-manager-5998dc798b-jx2hf 3/3 Running 1 7d19h pytorch-operator-77b7ff46c-hhfhj 2/2 Running 1 7d19h spark-operatorsparkoperator-579554d99d-mnkz2 2/2 Running 0 7d19h tensorboard-controller-controller-manager-6d99664986-n624x 3/3 Running 1 7d19h tensorboards-web-app-deployment-6b98985bc5-xv6rv 1/1 Running 0 7d19h tf-job-operator-5bb7675fb8-4nfhq 2/2 Running 1 7d19h volumes-web-app-deployment-b8d6cc797-xwdmz 2/2 Running 0 7d19h workflow-controller-5f9dbb559c-dw2tk 2/2 Running 0 7d19h xgboost-operator-deployment-7bf56c6d4f-cf7jc 2/2 Running 0 7d19h

cloudbow commented 3 years ago

I see only kfp cache . I did try turning on logs and running but here is what I got.

First run {"level":"info","ts":1622631681.9666345,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-pcbfb-2880583143"} {"level":"info","ts":1622631681.9666553,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-pcbfb-2880583143"} {"level":"info","ts":1622631681.9760761,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-pcbfb-2880583143"} {"level":"info","ts":1622631681.9761019,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-pcbfb-2880583143"} {"level":"info","ts":1622631682.023542,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-pcbfb-2880583143"} {"level":"info","ts":1622631682.0235684,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-pcbfb-2880583143"} {"level":"info","ts":1622631684.0169182,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-pcbfb-2880583143"} {"level":"info","ts":1622631684.0169406,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-pcbfb-2880583143"} {"level":"info","ts":1622631685.1026225,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-pcbfb-2880583143"} {"level":"info","ts":1622631685.1026473,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-pcbfb-2880583143"} {"level":"info","ts":1622631685.4869857,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-pcbfb-2880583143"} {"level":"info","ts":1622631685.487011,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-pcbfb-2880583143"} {"level":"info","ts":1622631686.1834323,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-pcbfb-2880583143"} {"level":"info","ts":1622631686.1834562,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-pcbfb-2880583143"} {"level":"info","ts":1622631692.0271308,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-pcbfb-2404255012"} {"level":"info","ts":1622631692.0308237,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-pcbfb-2404255012"} {"level":"info","ts":1622631692.0488806,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-pcbfb-2404255012"} {"level":"info","ts":1622631692.048913,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-pcbfb-2404255012"} {"level":"info","ts":1622631692.065883,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-pcbfb-2880583143"} {"level":"info","ts":1622631692.0659041,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-pcbfb-2880583143"} {"level":"info","ts":1622631692.110993,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-pcbfb-2404255012"} {"level":"info","ts":1622631692.1110253,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-pcbfb-2404255012"} {"level":"info","ts":1622631693.9256918,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-pcbfb-2404255012"} {"level":"info","ts":1622631693.9257143,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-pcbfb-2404255012"} {"level":"info","ts":1622631695.0097864,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-pcbfb-2404255012"} {"level":"info","ts":1622631695.009808,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-pcbfb-2404255012"} {"level":"info","ts":1622631695.2612553,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-pcbfb-2404255012"} {"level":"info","ts":1622631695.2612772,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-pcbfb-2404255012"} {"level":"info","ts":1622631696.0857518,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-pcbfb-2404255012"} {"level":"info","ts":1622631696.0857756,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-pcbfb-2404255012"} {"level":"info","ts":1622631702.0451055,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-pcbfb-2404255012"} {"level":"info","ts":1622631702.045128,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-pcbfb-2404255012"}

Next run {"level":"info","ts":1622631768.5726974,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-wshhv-687887970"} {"level":"info","ts":1622631768.572721,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-wshhv-687887970"} {"level":"info","ts":1622631768.5772638,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-wshhv-687887970"} {"level":"info","ts":1622631768.5772843,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-wshhv-687887970"} {"level":"info","ts":1622631768.6517446,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-wshhv-687887970"} {"level":"info","ts":1622631768.6517673,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-wshhv-687887970"} {"level":"info","ts":1622631769.8138227,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-wshhv-687887970"} {"level":"info","ts":1622631769.8138506,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-wshhv-687887970"} {"level":"info","ts":1622631770.8717377,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-wshhv-687887970"} {"level":"info","ts":1622631770.8717608,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-wshhv-687887970"} {"level":"info","ts":1622631771.2016962,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-wshhv-687887970"} {"level":"info","ts":1622631771.2017179,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-wshhv-687887970"} {"level":"info","ts":1622631771.949174,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-wshhv-687887970"} {"level":"info","ts":1622631771.9491968,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-wshhv-687887970"} {"level":"info","ts":1622631778.463374,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-mvkk7-3840758880"} {"level":"info","ts":1622631778.4633968,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-mvkk7-3840758880"} {"level":"info","ts":1622631778.4745355,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-mvkk7-3827168991"} {"level":"info","ts":1622631778.4745579,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-mvkk7-3827168991"} {"level":"info","ts":1622631778.4745853,"logger":"kfp-cache-controller","msg":"Pod does not exist","pod":"kubeflow-user/addition-pipeline-mvkk7-3840758880"} {"level":"info","ts":1622631778.4823828,"logger":"kfp-cache-controller","msg":"Pod does not exist","pod":"kubeflow-user/addition-pipeline-mvkk7-3827168991"} {"level":"info","ts":1622631778.7625508,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-wshhv-1927489725"} {"level":"info","ts":1622631778.7625754,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-wshhv-1927489725"} {"level":"info","ts":1622631778.7830899,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-wshhv-1927489725"} {"level":"info","ts":1622631778.783113,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-wshhv-1927489725"} {"level":"info","ts":1622631778.7965403,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-wshhv-687887970"} {"level":"info","ts":1622631778.7965593,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-wshhv-687887970"} {"level":"info","ts":1622631778.8399537,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-wshhv-1927489725"} {"level":"info","ts":1622631778.8399792,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-wshhv-1927489725"} {"level":"info","ts":1622631780.6956744,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-wshhv-1927489725"} {"level":"info","ts":1622631780.695712,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-wshhv-1927489725"} {"level":"info","ts":1622631781.8212683,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-wshhv-1927489725"} {"level":"info","ts":1622631781.8212938,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-wshhv-1927489725"} {"level":"info","ts":1622631782.109519,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-wshhv-1927489725"} {"level":"info","ts":1622631782.1095474,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-wshhv-1927489725"} {"level":"info","ts":1622631782.8937457,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-wshhv-1927489725"} {"level":"info","ts":1622631782.893769,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-wshhv-1927489725"} {"level":"info","ts":1622631788.7896814,"logger":"kfp-cache-controller","msg":"Successfully retrieved pod","pod":"kubeflow-user/addition-pipeline-wshhv-1927489725"} {"level":"info","ts":1622631788.7897058,"logger":"kfp-cache-controller","msg":"Pod is not a Kale step","pod":"kubeflow-user/addition-pipeline-wshhv-1927489725"}

jbottum commented 3 years ago

/kind question /priority p2 /area pipelines

google-oss-robot commented 3 years ago

@jbottum: The label(s) area/pipeliines cannot be applied, because the repository doesn't have them.

In response to [this](https://github.com/kubeflow/pipelines/issues/5764#issuecomment-859037860): >/kind question >/priority p2 >/area pipeliines Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
Bobgy commented 3 years ago

It seems KFP cache deployer is missing in minikf. @yanniszark @elikatsis who is the best person to ask about minikf?

elikatsis commented 3 years ago

Hi all!

It's true, we don't deploy the official KFP cache in MiniKF, for a few reasons:

  1. We are running into issues related to https://github.com/kubeflow/pipelines/issues/5257 because every Kale pipeline starts with a VolumeOp creating a PVC with the notebook environment & data. Thus, we cannot have this step cached because no PVC will get created.
  2. As mentioned in the aforementioned issue, currently the official cache is either enabled or disabled and doesn't permit excluding specific steps or components. This would solve (1). We already have some proposals (see https://github.com/kubeflow/pipelines/issues/5257#issuecomment-852326538) and we will elaborate on them since we want to make the KFP cache available on MiniKF as well.
  3. Our current approach for caching KFP steps relies on MLMD exclusively without using extra DB. We believe this is the way to go. This also seems to be in line with KFPv2 making MLMD the backbone of data passing and artifact logging.

By the way, our caching mechanism is deployed in the kubeflow namespace as the kfp-cache deployment, and that's what the logs above are about.

cc @StefanoFioravanzo

Bobgy commented 3 years ago

@elikatsis shall we document this on MiniKF side? Can we close the issue now?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

rimolive commented 4 months ago

Closing this issue, no activity for more than a year. If this issue persists in the latest release, please open a new issue.

/close

google-oss-prow[bot] commented 4 months ago

@rimolive: Closing this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/5764#issuecomment-2016831910): >Closing this issue, no activity for more than a year. If this issue persists in the latest release, please open a new issue. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.