kubeflow / common

Common APIs and libraries shared by other Kubeflow operator repositories.
Apache License 2.0
51 stars 73 forks source link

Panic when controller restarts #147

Closed Jeffwan closed 3 years ago

Jeffwan commented 3 years ago

I am working on all-in-one-operator development. There's a panic when I rebuild and restart the controller. Seems it's a problem of default Run policy setting issue. It's highly possible a problem in my dev branch. I create this issue just in case.

https://github.com/kubeflow/common/blob/f162091f3ea6b2275635d48116dd67c1b344ef61/pkg/controller.v1/common/job.go#L27

INFO[0002] Reconciling for job xgboost-dist-iris-test-train
E0730 22:06:10.553278   17780 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 324 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x2248160, 0x326c4e0)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/runtime/runtime.go:48 +0x82
panic(0x2248160, 0x326c4e0)
    /usr/local/opt/go@1.14/libexec/src/runtime/panic.go:969 +0x166
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).DeletePodsAndServices(0xc0001d5320, 0xc000603f18, 0x23e3d40, 0xc000603e00, 0xc000d13620, 0x3, 0x3, 0x241ebb1, 0x9)
    /Users/jiaxin/go/pkg/mod/github.com/kubeflow/common@v0.3.4/pkg/controller.v1/common/job.go:27 +0x3c
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).ReconcileJobs(0xc0001d5320, 0x23e3d40, 0xc000603e00, 0xc000d18b70, 0xc00013a2a0, 0x2, 0x2, 0xc000d18ba0, 0x0, 0xc000d13340, ...)
    /Users/jiaxin/go/pkg/mod/github.com/kubeflow/common@v0.3.4/pkg/controller.v1/common/job.go:161 +0x68f
github.com/kubeflow/tf-operator/pkg/controller.v1/xgboost.(*XGBoostJobReconciler).Reconcile(0xc0001d5320, 0x2689e40, 0xc000d18ab0, 0xc0008a9ed9, 0x7, 0xc0007a8620, 0x1c, 0xc000d18ab0, 0x100affb, 0xc000032000, ...)
    /Users/jiaxin/go/src/github.com/kubeflow/tf-operator/pkg/controller.v1/xgboost/xgboostjob_controller.go:169 +0x805
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000834000, 0x2689d80, 0xc00090c5c0, 0x22a5d00, 0xc000b4eb60)
    /Users/jiaxin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:263 +0x2f1
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000834000, 0x2689d80, 0xc00090c5c0, 0x0)
    /Users/jiaxin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:235 +0x202
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1(0x2689d80, 0xc00090c5c0)
    /Users/jiaxin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:198 +0x4a
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:185 +0x37
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc0004d2750)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000d3ff50, 0x264d5c0, 0xc000b68630, 0xc00090c501, 0xc0005e5080)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:156 +0xa3
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0004d2750, 0x3b9aca00, 0x0, 0x1, 0xc0005e5080)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x2689d80, 0xc00090c5c0, 0xc0006be820, 0x3b9aca00, 0x0, 0x1)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:185 +0xa6
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x2689d80, 0xc00090c5c0, 0xc0006be820, 0x3b9aca00)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:99 +0x57
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
    /Users/jiaxin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:195 +0x4f6
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x20c676c]

goroutine 324 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/runtime/runtime.go:55 +0x105
panic(0x2248160, 0x326c4e0)
    /usr/local/opt/go@1.14/libexec/src/runtime/panic.go:969 +0x166
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).DeletePodsAndServices(0xc0001d5320, 0xc000603f18, 0x23e3d40, 0xc000603e00, 0xc000d13620, 0x3, 0x3, 0x241ebb1, 0x9)
    /Users/jiaxin/go/pkg/mod/github.com/kubeflow/common@v0.3.4/pkg/controller.v1/common/job.go:27 +0x3c
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).ReconcileJobs(0xc0001d5320, 0x23e3d40, 0xc000603e00, 0xc000d18b70, 0xc00013a2a0, 0x2, 0x2, 0xc000d18ba0, 0x0, 0xc000d13340, ...)
    /Users/jiaxin/go/pkg/mod/github.com/kubeflow/common@v0.3.4/pkg/controller.v1/common/job.go:161 +0x68f
github.com/kubeflow/tf-operator/pkg/controller.v1/xgboost.(*XGBoostJobReconciler).Reconcile(0xc0001d5320, 0x2689e40, 0xc000d18ab0, 0xc0008a9ed9, 0x7, 0xc0007a8620, 0x1c, 0xc000d18ab0, 0x100affb, 0xc000032000, ...)
    /Users/jiaxin/go/src/github.com/kubeflow/tf-operator/pkg/controller.v1/xgboost/xgboostjob_controller.go:169 +0x805
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000834000, 0x2689d80, 0xc00090c5c0, 0x22a5d00, 0xc000b4eb60)
    /Users/jiaxin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:263 +0x2f1
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000834000, 0x2689d80, 0xc00090c5c0, 0x0)
    /Users/jiaxin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:235 +0x202
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1(0x2689d80, 0xc00090c5c0)
    /Users/jiaxin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:198 +0x4a
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:185 +0x37
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc0004d2750)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000d3ff50, 0x264d5c0, 0xc000b68630, 0xc00090c501, 0xc0005e5080)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:156 +0xa3
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0004d2750, 0x3b9aca00, 0x0, 0x1, 0xc0005e5080)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x2689d80, 0xc00090c5c0, 0xc0006be820, 0x3b9aca00, 0x0, 0x1)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:185 +0xa6
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x2689d80, 0xc00090c5c0, 0xc0006be820, 0x3b9aca00)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:99 +0x57
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
    /Users/jiaxin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:195 +0x4f6
Jeffwan commented 3 years ago
INFO[0035] Reconciling for job cleanpod-policy-tests-v1
E0803 15:09:22.068987   83123 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 391 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x2251700, 0x3277540)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/runtime/runtime.go:48 +0x82
panic(0x2251700, 0x3277540)
    /usr/local/opt/go@1.14/libexec/src/runtime/panic.go:969 +0x166
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).DeletePodsAndServices(0xc00029aea0, 0xc000406938, 0x23ed160, 0xc000406820, 0xc000337c00, 0x7, 0x8, 0x24282a9, 0x9)
    /Users/jiaxin/go/pkg/mod/github.com/kubeflow/common@v0.3.4/pkg/controller.v1/common/job.go:27 +0x3c
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).ReconcileJobs(0xc00029aea0, 0x23ed160, 0xc000406820, 0xc000972c00, 0xc000ca00e0, 0x2, 0x2, 0xc000972c60, 0xc000adc920, 0xc000adc940, ...)
    /Users/jiaxin/go/pkg/mod/github.com/kubeflow/common@v0.3.4/pkg/controller.v1/common/job.go:161 +0x68f
github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow.(*TFJobReconciler).Reconcile(0xc00029aea0, 0x2693a80, 0xc000972ae0, 0xc000d61520, 0x8, 0xc000d5cb60, 0x18, 0xc000972ae0, 0x100b05b, 0xc000032000, ...)
    /Users/jiaxin/go/src/github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow/tfjob_controller.go:153 +0x834
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00094e280, 0x26939c0, 0xc00002a000, 0x22af160, 0xc000adc760)
    /Users/jiaxin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:263 +0x2f1
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00094e280, 0x26939c0, 0xc00002a000, 0x0)
    /Users/jiaxin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:235 +0x202
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1(0x26939c0, 0xc00002a000)
    /Users/jiaxin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:198 +0x4a
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:185 +0x37
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000548750)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001097f50, 0x2657000, 0xc000014fc0, 0xc00002a001, 0xc000524300)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:156 +0xa3
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000548750, 0x3b9aca00, 0x0, 0x1, 0xc000524300)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x26939c0, 0xc00002a000, 0xc00088e5f0, 0x3b9aca00, 0x0, 0x24daf01)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:185 +0xa6
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x26939c0, 0xc00002a000, 0xc00088e5f0, 0x3b9aca00)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:99 +0x57
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
    /Users/jiaxin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:195 +0x4f6
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x20cfb1c]

goroutine 391 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/runtime/runtime.go:55 +0x105
panic(0x2251700, 0x3277540)
    /usr/local/opt/go@1.14/libexec/src/runtime/panic.go:969 +0x166
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).DeletePodsAndServices(0xc00029aea0, 0xc000406938, 0x23ed160, 0xc000406820, 0xc000337c00, 0x7, 0x8, 0x24282a9, 0x9)
    /Users/jiaxin/go/pkg/mod/github.com/kubeflow/common@v0.3.4/pkg/controller.v1/common/job.go:27 +0x3c
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).ReconcileJobs(0xc00029aea0, 0x23ed160, 0xc000406820, 0xc000972c00, 0xc000ca00e0, 0x2, 0x2, 0xc000972c60, 0xc000adc920, 0xc000adc940, ...)
    /Users/jiaxin/go/pkg/mod/github.com/kubeflow/common@v0.3.4/pkg/controller.v1/common/job.go:161 +0x68f
github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow.(*TFJobReconciler).Reconcile(0xc00029aea0, 0x2693a80, 0xc000972ae0, 0xc000d61520, 0x8, 0xc000d5cb60, 0x18, 0xc000972ae0, 0x100b05b, 0xc000032000, ...)
    /Users/jiaxin/go/src/github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow/tfjob_controller.go:153 +0x834
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00094e280, 0x26939c0, 0xc00002a000, 0x22af160, 0xc000adc760)
    /Users/jiaxin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:263 +0x2f1
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00094e280, 0x26939c0, 0xc00002a000, 0x0)
    /Users/jiaxin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:235 +0x202
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1(0x26939c0, 0xc00002a000)
    /Users/jiaxin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:198 +0x4a
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:185 +0x37
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000548750)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001097f50, 0x2657000, 0xc000014fc0, 0xc00002a001, 0xc000524300)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:156 +0xa3
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000548750, 0x3b9aca00, 0x0, 0x1, 0xc000524300)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x26939c0, 0xc00002a000, 0xc00088e5f0, 0x3b9aca00, 0x0, 0x24daf01)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:185 +0xa6
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x26939c0, 0xc00002a000, 0xc00088e5f0, 0x3b9aca00)
    /Users/jiaxin/go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:99 +0x57
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
    /Users/jiaxin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:195 +0x4f6
Jeffwan commented 3 years ago

This issue has been fixed here https://github.com/kubeflow/tf-operator/pull/1360.

It's not a problem of kubeflow/common. It happens here because it always assume there's value of runPolicy.