kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.6k stars 1.62k forks source link

[backend] Intermittent Failures of GPU-Enabled KFP Tasks with Exit Code 255 in Init Phase #10379

Closed tom-pavz closed 6 months ago

tom-pavz commented 9 months ago

Environment

Steps to reproduce

We have found it to be perhaps more likely to happen when there are more concurrent GPU tasks/nodes running in the cluster at once. EDIT: This seems to be not true, we have now observed this exit code 255 behavior when there is only one GPU node and GPU-enabled task running on the entire cluster. Probably this is just intermittent and more likely to happen the more GPU tasks are running simply because there are more of them so at least one of them getting this issue is more likely at that point.

Expected result

These GPU KFP tasks should never encounter this issue and instead, always succeed (or at least fail with an understandable reason that is based upon a bug in the source code of the task itself).

Materials and Reference


Impacted by this bug? Give it a 👍.

zijianjoy commented 9 months ago

Would you like to consider upgrading KFP version to the latest and try again? Currently you are using v2 alpha

tom-pavz commented 9 months ago

Would you like to consider upgrading KFP version to the latest and try again? Currently you are using v2 alpha

@zijianjoy Thank you for your reply!

I am under the impression that KFP is "backward compatible" from V2 to V1. Because we use the 1.8.22 version of the kfp python sdk, we are effectively still using V1 of KFP which would not be an unstable release. Please let me know if I am misunderstanding this.

Also, we are on KubeFlow version 1.7 as this is the most recent published release in the AWS Labs kubeflow-manifests repo: https://github.com/awslabs/kubeflow-manifests/releases, and even in the 1st party kubeflow/manifests repo, KF 1.7 has the V2 alpha KFP version: https://github.com/kubeflow/manifests/tree/v1.7.0.

So overall, I didn't think it was going to be very easy or safe to just "upgrade the KFP version" because of the factors I mentioned above. Please let me know if I am misunderstanding any of this.

zijianjoy commented 9 months ago

KFP v2 is already GAed. The latest Kubeflow version 1.8 is already using it. Please contact AWS in order to obtain a newer version of AWS distribution.

tom-pavz commented 9 months ago

KFP v2 is already GAed. The latest Kubeflow version 1.8 is already using it. Please contact AWS in order to obtain a newer version of AWS distribution.

@zijianjoy AWS is still undecided if they will create a new distro for KF 1.8 https://github.com/awslabs/kubeflow-manifests/issues/794.

Also, I am unconvinced this would even resolve our issue as we are still using KFP sdk 1.8.22, and so I don't see how bumping to a newer 2.x.x server version would help that? It doesn't seem to me like other KFP 1.8.x users are encountering this issue, so I was looking for some help on how to resolve it in our current deployment.

Also, even in the newest version of KFP V2 platform-specific things such as creating PVC on k8s, etc are all buggy still right now, and there is also no label setting or tolerance setting for the pods in v2 pipelines, which we need to isolate our KFP task pods onto our karpenter autoscaled EC2's.

zijianjoy commented 9 months ago

Understood about the situation. However, we currently have been focusing on supporting v2. Therefore, I will keep this issue open and lean on community to chime in for help.

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 6 months ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.