kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
417 stars 209 forks source link

the object has been modified; please apply your changes to the latest version and try again #607

Open gl-001 opened 7 months ago

gl-001 commented 7 months ago

version 0.4.0 anyone occurred this problem? i add some log and found doUpdateJobStatus function will raise "the object has been modified; please apply your changes to the latest version and try again" thx

tenzen-y commented 7 months ago

"the object has been modified; please apply your changes to the latest version and try again"

This is a well-known client-side apply issue. However, this error doesn't raise any bugs.

/close

google-oss-prow[bot] commented 7 months ago

@tenzen-y: Closing this issue.

In response to [this](https://github.com/kubeflow/mpi-operator/issues/607#issuecomment-1827619120): >> "the object has been modified; please apply your changes to the latest version and try again" > >This is a well-known client-side apply issue. However, this error doesn't raise any bugs. > >/close > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
gl-001 commented 7 months ago

this error will lead the job status Not Credible, which job will long time is running, but the pods was succeed. If there is a job after the mpi job in a pipeline, then the job will not be processed after waiting a long time. So are there any methods to solve this problem? @tenzen-y


Completion Time: 2023-11-27T03:34:09Z
Conditions:
Last Transition Time: 2023-11-27T03:31:
Last Update Time: 2023-11-27T03:31:20Z
Message: MPIJob a5qvbedvqod1-mpijob is created.
Reason: MPIJobCreated
Status: True
Type: Created
Last Transition Time: 2023-11-27T03:34:09Z.
Last Update Time: 2023-11-27T03:34:09Z
Message: Job has reached the specified backoff limit
Reason: BackoffLimitExceeded
Status: True
Type: Failed
Last Transition Time: 2023-11-27T03:34:09Z
Last Update Time: 2023-11-27T03:34:09Z
Message: MPIJob a5qvbedvqod1-mpijob is running.
Reason: MPIJobRunning
Status: True
Type: Running [will live a long time]
Replica Statuses:
Launcher:
Failed: 1
Worker:
Start Time: 2023-11-27T03:31:20Z   
tenzen-y commented 6 months ago

/reopen

google-oss-prow[bot] commented 6 months ago

@tenzen-y: Reopened this issue.

In response to [this](https://github.com/kubeflow/mpi-operator/issues/607#issuecomment-1869995577): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
tenzen-y commented 6 months ago

/kind support

google-oss-prow[bot] commented 6 months ago

@tenzen-y: The label(s) kind/support cannot be applied, because the repository doesn't have them.

In response to [this](https://github.com/kubeflow/mpi-operator/issues/607#issuecomment-1869995699): >/kind support Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
tenzen-y commented 6 months ago

this error will lead the job status Not Credible, which job will long time is running, but the pods was succeed. If there is a job after the mpi job in a pipeline, then the job will not be processed after waiting a long time. So are there any methods to solve this problem? @tenzen-y

Completion Time: 2023-11-27T03:34:09Z
Conditions:
Last Transition Time: 2023-11-27T03:31:
Last Update Time: 2023-11-27T03:31:20Z
Message: MPIJob a5qvbedvqod1-mpijob is created.
Reason: MPIJobCreated
Status: True
Type: Created
Last Transition Time: 2023-11-27T03:34:09Z.
Last Update Time: 2023-11-27T03:34:09Z
Message: Job has reached the specified backoff limit
Reason: BackoffLimitExceeded
Status: True
Type: Failed
Last Transition Time: 2023-11-27T03:34:09Z
Last Update Time: 2023-11-27T03:34:09Z
Message: MPIJob a5qvbedvqod1-mpijob is running.
Reason: MPIJobRunning
Status: True
Type: Running [will live a long time]
Replica Statuses:
Launcher:
Failed: 1
Worker:
Start Time: 2023-11-27T03:31:20Z   

@gl-001 Sorry for the late response. IIUC, if the update process fails, the controller will retry to update MPIJob. Can you share the mpi-operator logs with us?