Closed shaowei-su closed 3 years ago
Thanks for the issue, will fix it soon.
Did you set TTL for the job? Can you show the job status?
Can you please add my WeChat gaocedidi?
Maybe we can discuss it.
cc @ChanYiLin
@gaocegege Is there any more information for debugging?
I don't think it is because of the lack of ttl since we check if the ttl is nil or not.
The possible reason might be that the CompletionTime
is not set to the jobStatus.
Ok I might find the bug
https://github.com/kubeflow/common/blob/master/pkg/controller.v1/common/job.go#L245
When the job exceeds limit (BackoffLimit or ActiveDeadlineSeconds) it will failed and start to cleanup the resources
in the cleanup function it will use jobStatus.CompletionTime
but the jobStatus.CompletionTime
is set after the cleanup function
https://github.com/kubeflow/common/blob/master/pkg/controller.v1/common/job.go#L260
This is a bug in Kubeflow/common repo.
@shaowei-su Does this happen when some jobs exceed their limit?
Thanks for the quick response! @gaocegege @ChanYiLin
Did you set TTL for the job? Can you show the job status? Does this happen when some jobs exceed their limit?
This error happens whenever a TFJob failed and the status is
status:
conditions:
- lastTransitionTime: "2021-01-23T06:58:46Z"
lastUpdateTime: "2021-01-23T06:58:46Z"
message: TFJob tfjob-manual-ttl-8 is created.
reason: TFJobCreated
status: "True"
type: Created
- lastTransitionTime: "2021-01-23T06:58:49Z"
lastUpdateTime: "2021-01-23T06:58:49Z"
message: TFJob tfjob-manual-ttl-8 is running.
reason: TFJobRunning
status: "False"
type: Running
- lastTransitionTime: "2021-01-23T07:02:07Z"
lastUpdateTime: "2021-01-23T07:02:07Z"
message: TFJob tfjob-manual-ttl-8 has failed because 1 Worker replica(s) failed.
reason: TFJobFailed
status: "True"
type: Failed
replicaStatuses:
Worker:
active: 1
failed: 1
startTime: "2021-01-23T06:58:46Z"
@gaocegege @ChanYiLin PTAL: https://github.com/kubeflow/common/pull/108 thanks!
We need to update the vendor to fix it in tf-operator. Can you open a PR for it?
It is fixed by #1225
Hi @shaowei-su I wonder does the PR in kubeflow/common actually fix the issue Your logs showed that the controller failed when it was trying to claenup And the tfjob failed because of a worker failed not because limit exceeded. So I guess the the cleanup function call is from #L165 not from the block of limit exceed. Can you check again if the PR fix your issue?
You are right @ChanYiLin . I thought in the case of Success
or Failure
then the job status will have CompletionTime
preset before pass in, which is not the case in fact..
So I guess we need a code block like
if jobStatus.CompletionTime == nil {
now := metav1.Now()
jobStatus.CompletionTime = &now
}
in line #159 as well?
That will fix the problem indeed,
but I am curious why the tfjob got the status failed and the message TFJob tfjob-manual-ttl-8 has failed because 1 Worker replica(s) failed.
but didn't get the completion time.
According to the code here, it should append the completion time at the same time.
https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1/tensorflow/status.go#L192
Oh ... I got it, it was my fault https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1/tensorflow/status.go#L192 it should be
if jobStatus.CompletionTime == nil {
now := metav1.Now()
tfJob.Status.CompletionTime = &now
}
not tfJob.Status
...
@shaowei-su Can you help to fix the issue and test again if this fix your issue? Thanks
Ah good catch. Sure let me try this out.
Yup it works now after the fix:
status:
completionTime: "2021-01-26T07:44:16Z"
conditions:
- lastTransitionTime: "2021-01-26T07:40:53Z"
lastUpdateTime: "2021-01-26T07:40:53Z"
message: TFJob tfjob-manual-ttl-10 is created.
reason: TFJobCreated
status: "True"
type: Created
- lastTransitionTime: "2021-01-26T07:40:55Z"
lastUpdateTime: "2021-01-26T07:40:55Z"
message: TFJob bigqueue-jobs-staging/tfjob-manual-ttl-10 successfully completed.
reason: TFJobRunning
status: "False"
type: Running
- lastTransitionTime: "2021-01-26T07:44:16Z"
lastUpdateTime: "2021-01-26T07:44:16Z"
message: TFJob bigqueue-jobs-staging/tfjob-manual-ttl-10 has failed because 1
Worker replica(s) failed.
reason: TFJobFailed
status: "True"
type: Failed
replicaStatuses:
Worker:
active: 1
failed: 1
startTime: "2021-01-26T07:40:53Z"
PTAL @ChanYiLin https://github.com/kubeflow/tf-operator/pull/1226
/reopen
@gaocegege: Reopened this issue.
After upgrading the tf-operator to tag
v1.0.1
, controller starts failing with error:Controller built with: https://github.com/kubeflow/tf-operator/blob/master/build/images/tf_operator/Dockerfile Git SHA:
fb7b1616af823ada13d181849a3d00d9e9e6ac10