kubeflow / common

Common APIs and libraries shared by other Kubeflow operator repositories.
Apache License 2.0
51 stars 73 forks source link

Encounter NIL Error when job in error stage with TTL value set #170

Open mirocody opened 2 years ago

mirocody commented 2 years ago

Hi community, I am trying to deploy a simple task using pytorchjob with the following yaml:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorchjob
  namespace: abc
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: 'false'
        spec:
          containers:
          - args:
            - |+
              echo "Hello World!"
              python -u exception.py 
            command:
            - /usr/bin/env
            - bash
            - -c
            env:
            - name: LOCAL_RANK
              value: '0'
            image: <centos>
            name: pytorch
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: 'false'
        spec:
          containers:
          - args:
            - |+
              echo "Hello World!"
              python -u exception.py 
            command:
            - /usr/bin/env
            - bash
            - -c
            env:
            - name: LOCAL_RANK
              value: '0'
            image: <centos>
            name: pytorch

  runPolicy:
    ttlSecondsAfterFinished: 864000

the scripy exception.py is nothing but just throw an exception to let the contaienr go to error status. Then the training operator pod logs the following:

E1026 03:50:23.343541       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 560 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x16da180, 0x27a0b00)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/runtime/runtime.go:48 +0x82
panic(0x16da180, 0x27a0b00)
        /usr/local/go/src/runtime/panic.go:969 +0x166
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).CleanupJob(0xc000e89320, 0xc000703618, 0xc000f19600, 0x3, 0x3, 0xc000818720, 0x0, 0x0, 0x0, 0x18987c0, ...)
        /go/pkg/mod/github.com/kubeflow/common@v0.3.7/pkg/controller.v1/common/job.go:401 +0xbd
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).ReconcileJobs(0xc000e89320, 0x18987c0, 0xc000703500, 0xc0008183c0, 0xc000f19600, 0x3, 0x3, 0xc000818720, 0x0, 0x0, ...)
        /go/pkg/mod/github.com/kubeflow/common@v0.3.7/pkg/controller.v1/common/job.go:147 +0x76d
github.com/kubeflow/tf-operator/pkg/controller.v1/pytorch.(*PyTorchJobReconciler).Reconcile(0xc000e89320, 0x1b88fa0, 0xc000818270, 0xc000624f60, 0x13, 0xc000a1b590, 0x28, 0xc000818270, 0x40903b, 0xc000030000, ...)
        /workspace/pkg/controller.v1/pytorch/pytorchjob_controller.go:159 +0x83c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000743ea0, 0x1b88ee0, 0xc000d26400, 0x1750a40, 0xc000348340)
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:263 +0x2f1
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000743ea0, 0x1b88ee0, 0xc000d26400, 0x0)
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:235 +0x202
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1(0x1b88ee0, 0xc000d26400)
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:198 +0x4a
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:185 +0x37
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00026c750)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001121f50, 0x1b46440, 0xc000818180, 0xc000d26401, 0xc000a36240)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:156 +0xa3
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00026c750, 0x3b9aca00, 0x0, 0x1, 0xc000a36240)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x1b88ee0, 0xc000d26400, 0xc000c0eb10, 0x3b9aca00, 0x0, 0x1986d01)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:185 +0xa6
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x1b88ee0, 0xc000d26400, 0xc000c0eb10, 0x3b9aca00)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:99 +0x57
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:195 +0x4f6
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x14a257d]

It looks like the assumption in this line works that the completion time is not set when the clean up started.