kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.61k stars 698 forks source link

Error with release tag `v1.0.1` "invalid memory address or nil pointer dereference" #1223

Closed shaowei-su closed 3 years ago

shaowei-su commented 3 years ago

After upgrading the tf-operator to tag v1.0.1, controller starts failing with error:

E0122 19:05:34.207808       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 212 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1546400, 0x24bd090)
  /go/pkg/mod/k8s.io/apimachinery@v0.16.10-beta.0/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
  /go/pkg/mod/k8s.io/apimachinery@v0.16.10-beta.0/pkg/util/runtime/runtime.go:48 +0x82
panic(0x1546400, 0x24bd090)
  /usr/local/go/src/runtime/panic.go:679 +0x1b2
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).CleanupJob(0xc0001b6600, 0xc001494ad8, 0xc00105bce0, 0x3, 0x3, 0xc0022a1260, 0xc00238b5a0, 0x0, 0x0, 0x16f0ee0, ...)
  /go/pkg/mod/github.com/kubeflow/common@v0.3.1/pkg/controller.v1/common/job.go:315 +0xaf
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).ReconcileJobs(0xc0001b6600, 0x16f0ee0, 0xc0014949c0, 0xc0022a1140, 0xc00105bce0, 0x3, 0x3, 0xc0022a1260, 0xc00238b5a0, 0x0, ...)
  /go/pkg/mod/github.com/kubeflow/common@v0.3.1/pkg/controller.v1/common/job.go:113 +0x745
github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow.(*TFController).syncTFJob(0xc0001b6600, 0xc00222a120, 0x28, 0x0, 0x0, 0x0)
  /go/src/github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow/controller.go:326 +0x538
github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow.(*TFController).processNextWorkItem(0xc0001b6600, 0x0)
  /go/src/github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow/controller.go:262 +0x55f
github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow.(*TFController).runWorker(0xc0001b6600)
  /go/src/github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow/controller.go:217 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00256c620)
  /go/pkg/mod/k8s.io/apimachinery@v0.16.10-beta.0/pkg/util/wait/wait.go:152 +0x5e
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00256c620, 0x3b9aca00, 0x0, 0x1, 0xc0000ce480)
  /go/pkg/mod/k8s.io/apimachinery@v0.16.10-beta.0/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc00256c620, 0x3b9aca00, 0xc0000ce480)
  /go/pkg/mod/k8s.io/apimachinery@v0.16.10-beta.0/pkg/util/wait/wait.go:88 +0x4d
created by github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow.(*TFController).Run
  /go/src/github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow/controller.go:203 +0x2c4
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
  panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x13b8a3f]
goroutine 212 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
  /go/pkg/mod/k8s.io/apimachinery@v0.16.10-beta.0/pkg/util/runtime/runtime.go:55 +0x105
panic(0x1546400, 0x24bd090)
  /usr/local/go/src/runtime/panic.go:679 +0x1b2
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).CleanupJob(0xc0001b6600, 0xc001494ad8, 0xc00105bce0, 0x3, 0x3, 0xc0022a1260, 0xc00238b5a0, 0x0, 0x0, 0x16f0ee0, ...)
  /go/pkg/mod/github.com/kubeflow/common@v0.3.1/pkg/controller.v1/common/job.go:315 +0xaf
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).ReconcileJobs(0xc0001b6600, 0x16f0ee0, 0xc0014949c0, 0xc0022a1140, 0xc00105bce0, 0x3, 0x3, 0xc0022a1260, 0xc00238b5a0, 0x0, ...)
  /go/pkg/mod/github.com/kubeflow/common@v0.3.1/pkg/controller.v1/common/job.go:113 +0x745
github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow.(*TFController).syncTFJob(0xc0001b6600, 0xc00222a120, 0x28, 0x0, 0x0, 0x0)
  /go/src/github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow/controller.go:326 +0x538
github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow.(*TFController).processNextWorkItem(0xc0001b6600, 0x0)
  /go/src/github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow/controller.go:262 +0x55f
github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow.(*TFController).runWorker(0xc0001b6600)
  /go/src/github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow/controller.go:217 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00256c620)
  /go/pkg/mod/k8s.io/apimachinery@v0.16.10-beta.0/pkg/util/wait/wait.go:152 +0x5e
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00256c620, 0x3b9aca00, 0x0, 0x1, 0xc0000ce480)
  /go/pkg/mod/k8s.io/apimachinery@v0.16.10-beta.0/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc00256c620, 0x3b9aca00, 0xc0000ce480)
  /go/pkg/mod/k8s.io/apimachinery@v0.16.10-beta.0/pkg/util/wait/wait.go:88 +0x4d
created by github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow.(*TFController).Run
  /go/src/github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow/controller.go:203 +0x2c4

Controller built with: https://github.com/kubeflow/tf-operator/blob/master/build/images/tf_operator/Dockerfile Git SHA: fb7b1616af823ada13d181849a3d00d9e9e6ac10

gaocegege commented 3 years ago

Thanks for the issue, will fix it soon.

gaocegege commented 3 years ago

The error comes from https://github.com/kubeflow/tf-operator/blob/fb7b1616af823ada13d181849a3d00d9e9e6ac10/vendor/github.com/kubeflow/common/pkg/controller.v1/common/job.go#L315

gaocegege commented 3 years ago

Did you set TTL for the job? Can you show the job status?

gaocegege commented 3 years ago

Can you please add my WeChat gaocedidi?

Maybe we can discuss it.

gaocegege commented 3 years ago

cc @ChanYiLin

ChanYiLin commented 3 years ago

@gaocegege Is there any more information for debugging? I don't think it is because of the lack of ttl since we check if the ttl is nil or not. The possible reason might be that the CompletionTime is not set to the jobStatus.

ChanYiLin commented 3 years ago

Ok I might find the bug https://github.com/kubeflow/common/blob/master/pkg/controller.v1/common/job.go#L245 When the job exceeds limit (BackoffLimit or ActiveDeadlineSeconds) it will failed and start to cleanup the resources in the cleanup function it will use jobStatus.CompletionTime but the jobStatus.CompletionTime is set after the cleanup function https://github.com/kubeflow/common/blob/master/pkg/controller.v1/common/job.go#L260

This is a bug in Kubeflow/common repo.

ChanYiLin commented 3 years ago

@shaowei-su Does this happen when some jobs exceed their limit?

shaowei-su commented 3 years ago

Thanks for the quick response! @gaocegege @ChanYiLin

Did you set TTL for the job? Can you show the job status? Does this happen when some jobs exceed their limit?

This error happens whenever a TFJob failed and the status is

status:
  conditions:
  - lastTransitionTime: "2021-01-23T06:58:46Z"
    lastUpdateTime: "2021-01-23T06:58:46Z"
    message: TFJob tfjob-manual-ttl-8 is created.
    reason: TFJobCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2021-01-23T06:58:49Z"
    lastUpdateTime: "2021-01-23T06:58:49Z"
    message: TFJob tfjob-manual-ttl-8 is running.
    reason: TFJobRunning
    status: "False"
    type: Running
  - lastTransitionTime: "2021-01-23T07:02:07Z"
    lastUpdateTime: "2021-01-23T07:02:07Z"
    message: TFJob tfjob-manual-ttl-8 has failed because 1 Worker replica(s) failed.
    reason: TFJobFailed
    status: "True"
    type: Failed
  replicaStatuses:
    Worker:
      active: 1
      failed: 1
  startTime: "2021-01-23T06:58:46Z"
shaowei-su commented 3 years ago

@gaocegege @ChanYiLin PTAL: https://github.com/kubeflow/common/pull/108 thanks!

gaocegege commented 3 years ago

We need to update the vendor to fix it in tf-operator. Can you open a PR for it?

gaocegege commented 3 years ago

It is fixed by #1225

ChanYiLin commented 3 years ago

Hi @shaowei-su I wonder does the PR in kubeflow/common actually fix the issue Your logs showed that the controller failed when it was trying to claenup And the tfjob failed because of a worker failed not because limit exceeded. So I guess the the cleanup function call is from #L165 not from the block of limit exceed. Can you check again if the PR fix your issue?

shaowei-su commented 3 years ago

You are right @ChanYiLin . I thought in the case of Success or Failure then the job status will have CompletionTime preset before pass in, which is not the case in fact..

So I guess we need a code block like

        if jobStatus.CompletionTime == nil {
            now := metav1.Now()
            jobStatus.CompletionTime = &now
        }

in line #159 as well?

ChanYiLin commented 3 years ago

That will fix the problem indeed, but I am curious why the tfjob got the status failed and the message TFJob tfjob-manual-ttl-8 has failed because 1 Worker replica(s) failed. but didn't get the completion time. According to the code here, it should append the completion time at the same time. https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1/tensorflow/status.go#L192

ChanYiLin commented 3 years ago

Oh ... I got it, it was my fault https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1/tensorflow/status.go#L192 it should be

                if jobStatus.CompletionTime == nil {
                    now := metav1.Now()
                    tfJob.Status.CompletionTime = &now
                }

not tfJob.Status...

ChanYiLin commented 3 years ago

@shaowei-su Can you help to fix the issue and test again if this fix your issue? Thanks

shaowei-su commented 3 years ago

Ah good catch. Sure let me try this out.

shaowei-su commented 3 years ago

Yup it works now after the fix:

status:
  completionTime: "2021-01-26T07:44:16Z"
  conditions:
  - lastTransitionTime: "2021-01-26T07:40:53Z"
    lastUpdateTime: "2021-01-26T07:40:53Z"
    message: TFJob tfjob-manual-ttl-10 is created.
    reason: TFJobCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2021-01-26T07:40:55Z"
    lastUpdateTime: "2021-01-26T07:40:55Z"
    message: TFJob bigqueue-jobs-staging/tfjob-manual-ttl-10 successfully completed.
    reason: TFJobRunning
    status: "False"
    type: Running
  - lastTransitionTime: "2021-01-26T07:44:16Z"
    lastUpdateTime: "2021-01-26T07:44:16Z"
    message: TFJob bigqueue-jobs-staging/tfjob-manual-ttl-10 has failed because 1
      Worker replica(s) failed.
    reason: TFJobFailed
    status: "True"
    type: Failed
  replicaStatuses:
    Worker:
      active: 1
      failed: 1
  startTime: "2021-01-26T07:40:53Z"

PTAL @ChanYiLin https://github.com/kubeflow/tf-operator/pull/1226

gaocegege commented 3 years ago

/reopen

k8s-ci-robot commented 3 years ago

@gaocegege: Reopened this issue.

In response to [this](https://github.com/kubeflow/tf-operator/issues/1223#issuecomment-767382444): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.