Open TobiasGoerke opened 1 year ago
@TobiasGoerke what is the version of your KFP runtime? Maybe there is a bug when resolving cache key in the PVC creation operation. cc @chensun to learn more.
@TobiasGoerke what is the version of your KFP runtime? Maybe there is a bug when resolving cache key in the PVC creation operation. cc @chensun to learn more.
I'm on manifests/v1.8-branch, i.e. 2.0.3
.
I am also facing the exactly same issue with the same output on KFP backend 2.0.3
with Kubeflow 1.8.0 manifests deployment.
The PVC is created, but the component reported the error from the logs and exist with error.
F1117 21:35:33.015147 22 main.go:76] KFP driver: driver.Container(pipelineName=my-pipeline, runID=cd147529-1b6c-454b-b3e1-b2858ff98222, task="createpvc", component="comp-createpvc", dagExecutionID=29, componentSpec) failed: failed to create PVC and publish execution createpvc: failed to create cache entrty for create pvc: failed to create task: rpc error: code = InvalidArgument desc = Failed to create a new task due to validation error: Invalid input error: Invalid task: must specify FingerPrint
time="2023-11-17T21:35:33.321Z" level=info msg="sub-process exited" argo=true error="<nil>"
time="2023-11-17T21:35:33.322Z" level=error msg="cannot save parameter /tmp/outputs/pod-spec-patch" argo=true error="open /tmp/outputs/pod-spec-patch: no such file or directory"
time="2023-11-17T21:35:33.322Z" level=error msg="cannot save parameter /tmp/outputs/cached-decision" argo=true error="open /tmp/outputs/cached-decision: no such file or directory"
time="2023-11-17T21:35:33.322Z" level=error msg="cannot save parameter /tmp/outputs/condition" argo=true error="open /tmp/outputs/condition: no such file or directory"
Error: exit status 1
Just want to add some additional info. After experiencing this issue, kfp backend didn't work anymore in my case.
I have to restart all the deployments kubectl -n kubeflow rollout restart deployments
to be able to run v2 pipeline again.
With the api-server 2.0.5 with enable_caching=False
, this issue still exists.
2.0.5
(manifests v1.8 release modified)kfp 2.4.0
kfp-kubernetes 1.0.0
kfp-pipeline-spec 0.2.2
kfp-server-api 2.0.5
With the api-server 2.0.5 with
enable_caching=False
, this issue still exists.
- KFP Backend API-SERVER version:
2.0.5
(manifests v1.8 release modified)- KFP SDK version:
kfp 2.4.0 kfp-kubernetes 1.0.0 kfp-pipeline-spec 0.2.2 kfp-server-api 2.0.5
@yingding finally, it's working fine?
@kabartay Unfortunately, this issue still exists, even with
2.0.5
(manifests v1.8 release modified)kfp 2.6.0
kfp-kubernetes 1.1.0
kfp-pipeline-spec 0.3.0
kfp-server-api 2.0.5
Hopefully, it can be resolved in the next KFP backend API SERVER.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.
/reopen
Seems this issue has not been resolved, yet.
@AnnKatrinBecker: You can't reopen an issue/PR unless you authored it or you are a collaborator.
/reopen
@HumairAK: Reopened this issue.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/remove-lifecycle stale
Hi, what is this status of this issue ? Has anyone solved this or found any walkaround ?
/assign
This work for me (manifests 1.9.0): Don't set parameter enable_caching in create_run_from_pipeline_func or set it to None Set parameter set_caching_options(False) for each component
from kfp import dsl
from kfp import kubernetes
@dsl.component
def make_data():
with open('/data/file.txt', 'w') as f:
f.write('my data')
@dsl.component
def read_data():
with open('/reused_data/file.txt') as f:
print(f.read())
@dsl.pipeline
def my_pipeline():
pvc1 = kubernetes.CreatePVC(
# can also use pvc_name instead of pvc_name_suffix to use a pre-existing PVC
pvc_name_suffix='-my-pvc',
access_modes=['ReadWriteMany'],
size='5Gi',
storage_class_name='nfs-client',
)
task1 = make_data()
task1.set_caching_options(False)
# normally task sequencing is handled by data exchange via component inputs/outputs
# but since data is exchanged via volume, we need to call .after explicitly to sequence tasks
task2 = read_data().after(task1)
kubernetes.mount_pvc(
task1,
pvc_name=pvc1.outputs['name'],
mount_path='/data',
)
kubernetes.mount_pvc(
task2,
pvc_name=pvc1.outputs['name'],
mount_path='/reused_data',
)
# wait to delete the PVC until after task2 completes
delete_pvc1 = kubernetes.DeletePVC(
pvc_name=pvc1.outputs['name']).after(task2)
"""
# create your kfp_client
kfp_client.create_run_from_pipeline_func(my_pipeline, enable_caching=None, run_name='test-piepline', arguments={}, namespace='kubeflow-user-example-com')
"""
I found that if "CreatePVC" use "pvc_name" instead of "pvc_name_suffix", kfp will use cached createpvc node created before. I think if set "enable_caching=False" during create_run_from_pipeline_func, createpvc node(use pvc_name) may can not work. But use "pvc_name_suffix" should work because it always create new pvc with random name ?
Environment
2.0.3
(manifests v1.8 release)Steps to reproduce
Given the following example:
The pipeline will fail. Note the
enable_caching
, which will cause the issue when set to False.We will see an error in the created PVC step:
Impacted by this bug? Give it a 👍.