[backend] Intermittent Failures of GPU-Enabled KFP Tasks with Exit Code 255 in Init Phase

tom-pavz commented 9 months ago

Environment

How did you deploy Kubeflow Pipelines (KFP)? We used the AWS Labs provided kubeflow-manifests rds-s3 terraform full deployment option v1.7.0-aws-b1.0.3 https://awslabs.github.io/kubeflow-manifests/release-v1.7.0-aws-b1.0.3/docs/deployment/rds-s3/guide-terraform/
KFP version: 2.0.0-alpha.7
KFP SDK version: 1.8.22

Steps to reproduce

Installed the nvidia gpu-operator v23.9.1 https://github.com/NVIDIA/gpu-operator/tree/v23.9.1 by setting the following in AWS Labs kubeflow-manifests https://github.com/awslabs/kubeflow-manifests/blob/v1.7.0-aws-b1.0.3/deployments/rds-s3/terraform/main.tf#L218
Installed karpenter for autoscaling of GPU nodes using helm v0.32.1 https://karpenter.sh/v0.32/getting-started/migrating-from-cas/#deploy-karpenter
Run a KFP that has task(s) that are configured to run on GPU using:
```
task.set_gpu_limit("1")
```
These tasks will intermittently fail with This step is in Error state with this message: Unknown (exit code 255). Most of the time they are successful.

We have found it to be perhaps more likely to happen when there are more concurrent GPU tasks/nodes running in the cluster at once. EDIT: This seems to be not true, we have now observed this exit code 255 behavior when there is only one GPU node and GPU-enabled task running on the entire cluster. Probably this is just intermittent and more likely to happen the more GPU tasks are running simply because there are more of them so at least one of them getting this issue is more likely at that point.

Expected result

These GPU KFP tasks should never encounter this issue and instead, always succeed (or at least fail with an understandable reason that is based upon a bug in the source code of the task itself).

Materials and Reference

It looks like (at least the one time that we closely monitored this as it happened) this init container failed with 255 for an unknown reason gcr.io/ml-pipeline/argoexec:v3.3.8-license-compliance with no error logs in the logs of that container, just normal info logs downloading the input s3 artifacts. It seems like this container isn't actually what is causing the failure.
Hitting retry on these pipelines/tasks sees them succeed, as the issue is intermittent. In fact, other tasks in the pipeline can get scheduled on the same node after this first task on it fails and actually run to completion successfully, so it doesn't seem that the entire node is broken for any reason.
We have only ever seen this issue on tasks that are requesting to be run on a GPU-enabled EC2 (g4dn). The AMI we are using is the Amazon EKS optimized accelerated Amazon Linux 2 AMI, example: amazon-eks-gpu-node-1.25-v20231230
Setting a task.set_retry(num_retries) does not work, because the task's proper pipeline task container never actually gets started as the pod fails in the init stage and so the KF retries never come into play.
Setting the k8s restartPolicy on the KFP pod itself to onFailure is not an option because:
1. KFP SDK does not provide an interface to do this
2. This would cause all pipelines that have legitimate failures (say a bug in the src code of the task) to indefinitely retry over and over again and never actually fail the pipeline as it should.

Impacted by this bug? Give it a 👍.

zijianjoy commented 9 months ago

Would you like to consider upgrading KFP version to the latest and try again? Currently you are using v2 alpha

tom-pavz commented 9 months ago

Would you like to consider upgrading KFP version to the latest and try again? Currently you are using v2 alpha

@zijianjoy Thank you for your reply!

I am under the impression that KFP is "backward compatible" from V2 to V1. Because we use the 1.8.22 version of the kfp python sdk, we are effectively still using V1 of KFP which would not be an unstable release. Please let me know if I am misunderstanding this.

Also, we are on KubeFlow version 1.7 as this is the most recent published release in the AWS Labs kubeflow-manifests repo: https://github.com/awslabs/kubeflow-manifests/releases, and even in the 1st party kubeflow/manifests repo, KF 1.7 has the V2 alpha KFP version: https://github.com/kubeflow/manifests/tree/v1.7.0.

So overall, I didn't think it was going to be very easy or safe to just "upgrade the KFP version" because of the factors I mentioned above. Please let me know if I am misunderstanding any of this.

zijianjoy commented 9 months ago

KFP v2 is already GAed. The latest Kubeflow version 1.8 is already using it. Please contact AWS in order to obtain a newer version of AWS distribution.

tom-pavz commented 9 months ago

KFP v2 is already GAed. The latest Kubeflow version 1.8 is already using it. Please contact AWS in order to obtain a newer version of AWS distribution.

@zijianjoy AWS is still undecided if they will create a new distro for KF 1.8 https://github.com/awslabs/kubeflow-manifests/issues/794.

Also, I am unconvinced this would even resolve our issue as we are still using KFP sdk 1.8.22, and so I don't see how bumping to a newer 2.x.x server version would help that? It doesn't seem to me like other KFP 1.8.x users are encountering this issue, so I was looking for some help on how to resolve it in our current deployment.

Also, even in the newest version of KFP V2 platform-specific things such as creating PVC on k8s, etc are all buggy still right now, and there is also no label setting or tolerance setting for the pods in v2 pipelines, which we need to isolate our KFP task pods onto our karpenter autoscaled EC2's.

zijianjoy commented 9 months ago

Understood about the situation. However, we currently have been focusing on supporting v2. Therefore, I will keep this issue open and lean on community to chime in for help.

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 6 months ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

kubeflow / pipelines