kubeflow / pytorch-operator

PyTorch on Kubernetes
Apache License 2.0
306 stars 143 forks source link

Operator has invalid memory address error on specific pytorchjob spec #321

Open ca-scribner opened 3 years ago

ca-scribner commented 3 years ago

When running the following yaml,

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: my-pytorchjob
  namespace: my-namespace
spec:
  activeDeadlineSeconds: -1
  cleanPodPolicy: Running
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - args:
            - --backend
            - gloo
            image: pytorch-dist-mnist # (from examples folder)
            name: pytorch
          # imagePullSecrets:
          # - name: image-pull-secret
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - args:
            - --backend
            - gloo
            image: pytorch-dist-mnist # (from examples folder)
            name: pytorch
          # imagePullSecrets:
          # - name: image-pull-secret
  ttlSecondsAfterFinished: -1

I encounter a memory address/nil pointer error putting the operator into an infinite crash loop:

k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001c09c70, 0x3b9aca00, 0x0, 0x1, 0xc0000c2180)
/go/pkg/mod/k8s.io/apimachinery@v0.15.10-beta.0/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc001c09c70, 0x3b9aca00, 0xc0000c2180)
/go/pkg/mod/k8s.io/apimachinery@v0.15.10-beta.0/pkg/util/wait/wait.go:88 +0x4d
created by github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch.(*PyTorchController).Run
/go/src/github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch/controller.go:202 +0x2c4
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1275e83]
goroutine 210 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/go/pkg/mod/k8s.io/apimachinery@v0.15.10-beta.0/pkg/util/runtime/runtime.go:58 +0x105
panic(0x13f3ea0, 0x2213c70)
/usr/local/go/src/runtime/panic.go:679 +0x1b2
github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch.(*PyTorchController).cleanupPyTorchJob(0xc000149040, 0xc00028fc80, 0x0, 0x0)
/go/src/github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch/job.go:194 +0x73
github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch.(*PyTorchController).reconcilePyTorchJobs(0xc000149040, 0xc00028fc80, 0xc00028fc80, 0xc00014a210)
/go/src/github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch/controller.go:434 +0x1265
github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch.(*PyTorchController).syncPyTorchJob(0xc000149040, 0xc00014a200, 0x39, 0x0, 0x0, 0x0)
/go/src/github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch/controller.go:324 +0x4a2
github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch.(*PyTorchController).processNextWorkItem(0xc000149040, 0x0)
/go/src/github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch/controller.go:262 +0x55f
github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch.(*PyTorchController).runWorker(0xc000149040)
/go/src/github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch/controller.go:216 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc001c09c70)
/go/pkg/mod/k8s.io/apimachinery@v0.15.10-beta.0/pkg/util/wait/wait.go:152 +0x5e
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001c09c70, 0x3b9aca00, 0x0, 0x1, 0xc0000c2180)
/go/pkg/mod/k8s.io/apimachinery@v0.15.10-beta.0/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc001c09c70, 0x3b9aca00, 0xc0000c2180)
/go/pkg/mod/k8s.io/apimachinery@v0.15.10-beta.0/pkg/util/wait/wait.go:88 +0x4d
created by github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch.(*PyTorchController).Run
/go/src/github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch/controller.go:202 +0x2c4

As far as I can tell, this only happens if I include all three of ttlSecondsAfterFinished: -1, activeDeadlineSeconds: -1 and cleanPodPolicy: Running. I'm not sure if the -1's are valid inputs, but either way I was surprised that it caused a crash in the operator rather than a rejection of the spec

gaocegege commented 3 years ago

I think it is related to https://github.com/kubeflow/tf-operator/issues/1223