kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.51k stars 660 forks source link

mpijob will stuck if LastReconcileTime is updated in 1 second #2118

Open shadowdsp opened 1 month ago

shadowdsp commented 1 month ago

My mpijob will stuck forever because SyncPodGroup error within 1 second.

For example:

  1. At 00:00:00.100 SyncPodGroup created the pod group, and get the pod group failed.
  2. At 00:00:00.200 SyncPodGroup try to update the pod group, but there is a confliction error, just as Operation cannot be fulfilled on ...
    1. Then the controller will set the LastReconcileTime at the same as step 1.
    2. Finally the controller will UpdateJobStatusInApiServer while the job spec is not changed, and will not trigger the next reconcile