Training job restart enhancement

emeraldbay commented 3 months ago

What you would like to be added?

Description

We are proposing changes to enhance training job restart, that can help avoid restart failures and delays in case of GPU instance/k8s node failures:

Better job restart trigger: add a k8s node watcher to watch for node condition and label change, so that Kubeflow training operator could trigger training job restart based on NodeCondition or NodeLabel change.
Add training job max retry count support. If one training job exceeds max retry count, the training job will be deleted.
Enforce pod force deletion during restart.

Why is this needed?

As mentioned in issue 2072, currently failed K8s nodes leave jobs hanging indefinitely. The planned solution is adding Pod Failure Policy and Pod Disruption Condition support. But when training job hit GPU failure, training job might stuck and Pod may not exit with failure status. We need better integration with K8s node fault detection or Nvidia GPU fault detection mechanisms, e.g. Node problem detector uses NodeCondition to report problems to apiserver. We want to add a k8s node watcher to keep monitoring NodeCondition and NodeLabel change to trigger the training job restart (e.g. delete all the pods related to the training job)
Current training operator restart policy does not support max retry count, if we set restartPolicy as restart on Failure, it will enter into infinite retry, it means one training job will occupy the resources infinitely after failure. We want to add a max retry count option.
Current DeletePod implementation does not do force pod deletion. Add a -force option for pod deletion that overrides the default 30s grace period. The default 30s grace period causes unnecessary delays to restarts, and some pod might stuck in Terminating status. We want to add force pod deletion option.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

andreyvelich commented 3 months ago

Thank you for creating this @emeraldbay!

Better job restart trigger:

Since the pod failure policy might not work and we require additional node watcher to detect GPU issues, do we want to implement this feature on TrainJob or JobSet level @tenzen-y ?

Add training job max retry count support. If one training job exceeds max retry count, the training job will be deleted.

This will be supported in V2 APIs: https://github.com/kubeflow/training-operator/pull/2171

Current DeletePod implementation does not do force pod deletion. Add a -force option for pod deletion that overrides the default 30s grace period. The default 30s grace period causes unnecessary delays to restarts, and some pod might stuck in Terminating status

@tenzen-y Is there a way to configure grace period on batch Job for pods ?

emeraldbay commented 3 months ago

Thanks. Could you please provide some contexts about the difference between Kubeflow training operator v2 vs. JobSet? Is JobSet expected to eventually replace Kubeflow training operator in terms of training job submission?

andreyvelich commented 2 months ago

Thanks. Could you please provide some contexts about the difference between Kubeflow training operator v2 vs. JobSet? Is JobSet expected to eventually replace Kubeflow training operator in terms of training job submission?

The training job submission will still be via Training Operator: https://github.com/kubeflow/training-operator/blob/1f336d01af2c1e305bd6e660e079ffea107a51a9/docs/proposals/2170-kubeflow-training-v2/README.md#user-roles-diagram. TrainJob will just create an appropriate JobSet and additional resources (e.g. hostfile for MPI) to orchestrate resources for model training.

emeraldbay commented 2 months ago

Thanks. @tenzen-y Could you please help comment for above questions?

emeraldbay commented 2 months ago

Any update on this?

tenzen-y commented 2 months ago

Since the pod failure policy might not work and we require additional node watcher to detect GPU issues, do we want to implement this feature on TrainJob or JobSet level @tenzen-y ?

Sorry, I could not understand the reason for this. Why can the pod failure policy not detect Node problems?

emeraldbay commented 1 month ago

When GPU failure happens, training job might just stuck and pod does not exit with failure.

tenzen-y commented 1 month ago

When GPU failure happens, training job might just stuck and pod does not exit with failure.

As far as I know, there is not happened. If you can face the problem, that is a bug for the kubelet or device plugin. I would recommend to report to SIG Node.

emeraldbay commented 1 month ago

One example is NCCL communication stuck due to GPU failure, eventually it will timeout and kueblet will react on that, but that might be a long wait depends on the NCCL timeout config. We have better signals from node level and we want to have the capability to act on that.

Nvidia device plugin might report missing GPU for some cases, but generally we see it did not cover all the failure patterns.

andreyvelich commented 1 month ago

@emeraldbay For the NCCL communication error, don't you want to integrate custom error handlers in your PyTorch code in case of timeout ?

Nvidia device plugin might report missing GPU for some cases, but generally we see it did not cover all the failure patterns.

Do you know how device plugin detects such missing GPUs and how it reports the results ?

emeraldbay commented 1 month ago

@andreyvelich In short, we want fast recovery before NCCL time out, instead of using NCCL time out to trigger our recovery/error handler. This is mainly because we saw other signals could tell us there are GPU failures.

Device plugin checks a subset of driver error logs and will change available GPU device count if it detects some failure, e.g. "Updated allocatable device="nvidia.com/gpu" allocatable=X". Overall Kubelet and Nvidia device plugin did not offer what we need right now.

For this issue, I am mainly trying to understand whether you are open to enhance job restart logic to consider node/GPU failures. If you guys think Kubelet and Nvidia device plugin are responsible for detection, and pod failure should be the only trigger of KTO job restart when node/GPU failures, pls let us know. Thanks

andreyvelich commented 1 month ago

This is mainly because we saw other signals could tell us there are GPU failures.

What kind of signals do you monitor to detect such failures ? Do you track GPU utilization via Nvidia DCGM exporter or something else ?

If you guys think Kubelet and Nvidia device plugin are responsible for detection, and pod failure should be the only trigger of KTO job restart when node/GPU failures, pls let us know. Thanks

@kubeflow/wg-training-leads Any thoughts on this ? Should we detect such use-cases during the Training Operator orchestration logic ?

emeraldbay commented 1 month ago

What kind of signals do you monitor to detect such failures ? Do you track GPU utilization via Nvidia DCGM exporter or something else ?

We have our own fault detection mechanism, through running some Daemonset to do continuous monitoring

kubeflow / training-operator