kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.61k stars 1.63k forks source link

[sdk] enable_caching breaks when using CreatePVC: must specify FingerPrint #10188

Open TobiasGoerke opened 1 year ago

TobiasGoerke commented 1 year ago

Environment

Steps to reproduce

Given the following example:

from kfp import dsl
from kfp import kubernetes

@dsl.component
def test_step():
    print("Hello world")

@dsl.pipeline
def test_pipeline():
    kubernetes.CreatePVC(
        access_modes=["ReadWriteOnce"],
        size="10Mi",
        storage_class_name="default",
    )
    test_step()

client.create_run_from_pipeline_func(test_pipeline, arguments={}, enable_caching=False)

The pipeline will fail. Note the enable_caching, which will cause the issue when set to False.

We will see an error in the created PVC step:

F1031 14:29:54.216337 27 main.go:76] KFP driver: driver.Container(pipelineName=test-pipeline, runID=02ad61d6-8b9b-47a7-b626-0d65f3838b42, task="createpvc", component="comp-createpvc", dagExecutionID=9094, componentSpec) failed: failed to create PVC and publish execution createpvc: failed to create cache entrty for create pvc: failed to create task: rpc error: code = InvalidArgument desc = Failed to create a new task due to validation error: Invalid input error: Invalid task: must specify FingerPrint
time="2023-10-31T14:29:54.940Z" level=info msg="sub-process exited" argo=true error="<nil>"
time="2023-10-31T14:29:54.940Z" level=error msg="cannot save parameter /tmp/outputs/pod-spec-patch" argo=true error="open /tmp/outputs/pod-spec-patch: no such file or directory"
time="2023-10-31T14:29:54.940Z" level=error msg="cannot save parameter /tmp/outputs/cached-decision" argo=true error="open /tmp/outputs/cached-decision: no such file or directory"
time="2023-10-31T14:29:54.940Z" level=error msg="cannot save parameter /tmp/outputs/condition" argo=true error="open /tmp/outputs/condition: no such file or directory"
Error: exit status 1

Impacted by this bug? Give it a 👍.

zijianjoy commented 1 year ago

@TobiasGoerke what is the version of your KFP runtime? Maybe there is a bug when resolving cache key in the PVC creation operation. cc @chensun to learn more.

TobiasGoerke commented 1 year ago

@TobiasGoerke what is the version of your KFP runtime? Maybe there is a bug when resolving cache key in the PVC creation operation. cc @chensun to learn more.

I'm on manifests/v1.8-branch, i.e. 2.0.3.

yingding commented 12 months ago

I am also facing the exactly same issue with the same output on KFP backend 2.0.3 with Kubeflow 1.8.0 manifests deployment. The PVC is created, but the component reported the error from the logs and exist with error.

F1117 21:35:33.015147      22 main.go:76] KFP driver: driver.Container(pipelineName=my-pipeline, runID=cd147529-1b6c-454b-b3e1-b2858ff98222, task="createpvc", component="comp-createpvc", dagExecutionID=29, componentSpec) failed: failed to create PVC and publish execution createpvc: failed to create cache entrty for create pvc: failed to create task: rpc error: code = InvalidArgument desc = Failed to create a new task due to validation error: Invalid input error: Invalid task: must specify FingerPrint
time="2023-11-17T21:35:33.321Z" level=info msg="sub-process exited" argo=true error="<nil>"
time="2023-11-17T21:35:33.322Z" level=error msg="cannot save parameter /tmp/outputs/pod-spec-patch" argo=true error="open /tmp/outputs/pod-spec-patch: no such file or directory"
time="2023-11-17T21:35:33.322Z" level=error msg="cannot save parameter /tmp/outputs/cached-decision" argo=true error="open /tmp/outputs/cached-decision: no such file or directory"
time="2023-11-17T21:35:33.322Z" level=error msg="cannot save parameter /tmp/outputs/condition" argo=true error="open /tmp/outputs/condition: no such file or directory"
Error: exit status 1
yingding commented 12 months ago

Just want to add some additional info. After experiencing this issue, kfp backend didn't work anymore in my case. I have to restart all the deployments kubectl -n kubeflow rollout restart deployments to be able to run v2 pipeline again.

yingding commented 10 months ago

With the api-server 2.0.5 with enable_caching=False, this issue still exists.

kabartay commented 9 months ago

With the api-server 2.0.5 with enable_caching=False, this issue still exists.

  • KFP Backend API-SERVER version: 2.0.5 (manifests v1.8 release modified)
  • KFP SDK version:
kfp                      2.4.0
kfp-kubernetes           1.0.0
kfp-pipeline-spec        0.2.2
kfp-server-api           2.0.5

@yingding finally, it's working fine?

yingding commented 9 months ago

@kabartay Unfortunately, this issue still exists, even with

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 6 months ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

AnnKatrinBecker commented 6 months ago

/reopen

Seems this issue has not been resolved, yet.

google-oss-prow[bot] commented 6 months ago

@AnnKatrinBecker: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubeflow/pipelines/issues/10188#issuecomment-2109523925): >/reopen > >Seems this issue has not been resolved, yet. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
HumairAK commented 6 months ago

/reopen

google-oss-prow[bot] commented 6 months ago

@HumairAK: Reopened this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/10188#issuecomment-2110409080): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

HumairAK commented 4 months ago

/remove-lifecycle stale

haiminh2001 commented 2 months ago

Hi, what is this status of this issue ? Has anyone solved this or found any walkaround ?

hbelmiro commented 1 month ago

/assign

Sunspirytus commented 2 weeks ago

This work for me (manifests 1.9.0): Don't set parameter enable_caching in create_run_from_pipeline_func or set it to None Set parameter set_caching_options(False) for each component

from kfp import dsl
from kfp import kubernetes

@dsl.component
def make_data():
    with open('/data/file.txt', 'w') as f:
        f.write('my data')

@dsl.component
def read_data():
    with open('/reused_data/file.txt') as f:
        print(f.read())

@dsl.pipeline
def my_pipeline():
    pvc1 = kubernetes.CreatePVC(
        # can also use pvc_name instead of pvc_name_suffix to use a pre-existing PVC
        pvc_name_suffix='-my-pvc',
        access_modes=['ReadWriteMany'],
        size='5Gi',
        storage_class_name='nfs-client',
    )

    task1 = make_data()
    task1.set_caching_options(False)
    # normally task sequencing is handled by data exchange via component inputs/outputs
    # but since data is exchanged via volume, we need to call .after explicitly to sequence tasks
    task2 = read_data().after(task1)

    kubernetes.mount_pvc(
        task1,
        pvc_name=pvc1.outputs['name'],
        mount_path='/data',
    )
    kubernetes.mount_pvc(
        task2,
        pvc_name=pvc1.outputs['name'],
        mount_path='/reused_data',
    )

    # wait to delete the PVC until after task2 completes
    delete_pvc1 = kubernetes.DeletePVC(
        pvc_name=pvc1.outputs['name']).after(task2)

"""
# create your kfp_client
kfp_client.create_run_from_pipeline_func(my_pipeline, enable_caching=None, run_name='test-piepline', arguments={}, namespace='kubeflow-user-example-com')
"""

I found that if "CreatePVC" use "pvc_name" instead of "pvc_name_suffix", kfp will use cached createpvc node created before. I think if set "enable_caching=False" during create_run_from_pipeline_func, createpvc node(use pvc_name) may can not work. But use "pvc_name_suffix" should work because it always create new pvc with random name ?