project-codeflare / instascale

On-demand Kubernetes/OpenShift cluster scaling and aggregated resource provisioning
Apache License 2.0
10 stars 19 forks source link

Instascale pod crashing with panic after cluster.up() with instascale=False #35

Closed jbusche closed 1 year ago

jbusche commented 1 year ago

I just realized that the Instascale pod is crashing and restarting itself after I issue a cluster.up(). My current cluster config is the following:

cluster = Cluster(ClusterConfiguration(name='jim-mnisttest', min_worker=2, max_worker=2, min_cpus=2, max_cpus=4, min_memory=8, max_memory=8, gpu=0, instascale=False, auth=auth))

And after the cluster.up is submitted, if I'm following the instascale pod, I'll see it panic and then restart, like this:

oc logs -f instascale-9dcf85dcf-9cfzc 
I0223 19:48:50.119654       1 request.go:665] Waited for 1.033932486s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/discovery.k8s.io/v1?timeout=32s
1.6771817321736054e+09  INFO    controller-runtime.metrics  Metrics server is starting to listen    {"addr": ":8080"}
1.677181732174286e+09   INFO    setup   starting manager
1.6771817321773903e+09  INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.6771817321773977e+09  INFO    Starting server {"kind": "health probe", "addr": "[::]:8081"}
1.67718173217752e+09    INFO    controller.appwrapper   Starting EventSource    {"reconciler group": "mcad.ibm.com", "reconciler kind": "AppWrapper", "source": "kind source: *v1beta1.AppWrapper"}
1.6771817321775486e+09  INFO    controller.appwrapper   Starting Controller {"reconciler group": "mcad.ibm.com", "reconciler kind": "AppWrapper"}
1.6771817322781909e+09  INFO    controller.appwrapper   Starting workers    {"reconciler group": "mcad.ibm.com", "reconciler kind": "AppWrapper", "worker count": 1}
I0223 19:48:52.281672       1 appwrapper_controller.go:129] Got config map named: instascale-config that configures max nodes in cluster to value 15
I0223 19:48:52.384790       1 appwrapper_controller.go:223] Found Appwrapper named jim-mnisttest that has status Running
I0223 19:50:06.416947       1 appwrapper_controller.go:420] Appwrapper deleted scale-down machineset: jim-mnisttest 
I0223 19:50:30.335775       1 appwrapper_controller.go:223] Found Appwrapper named jim-mnisttest that has status 
E0223 19:50:30.335862       1 runtime.go:78] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0)
goroutine 490 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x15eeba0, 0xc000175770})
    /go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/runtime/runtime.go:74 +0x7d
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00043e898})
    /go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/runtime/runtime.go:48 +0x75
panic({0x15eeba0, 0xc000175770})
    /usr/local/go/src/runtime/panic.go:1038 +0x215
github.com/project-codeflare/instascale/controllers.discoverInstanceTypes(0xc000501800)
    /workspace/controllers/appwrapper_controller.go:287 +0x285
github.com/project-codeflare/instascale/controllers.onAdd({0x16970c0, 0xc000501800})
    /workspace/controllers/appwrapper_controller.go:226 +0x11e
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
    /go/pkg/mod/k8s.io/client-go@v0.23.0/tools/cache/controller.go:231
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnAdd({0x1798b20, {0x18e63f8, 0xc000116540}}, {0x16970c0, 0xc000501800})
    /go/pkg/mod/k8s.io/client-go@v0.23.0/tools/cache/controller.go:264 +0x64
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
    /go/pkg/mod/k8s.io/client-go@v0.23.0/tools/cache/shared_informer.go:787 +0x9f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7fd823cf9fb8)
    /go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00007cf38, {0x18bfba0, 0xc00016bc50}, 0x1, 0xc0006167e0)
    /go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0000a8cc0, 0x3b9aca00, 0x0, 0x57, 0xc00007cf88)
    /go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
    /go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc00004c280)
    /go/pkg/mod/k8s.io/client-go@v0.23.0/tools/cache/shared_informer.go:781 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
    /go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:73 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
    /go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:71 +0x88
panic: runtime error: index out of range [0] with length 0 [recovered]
    panic: runtime error: index out of range [0] with length 0

goroutine 490 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00043e898})
    /go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/runtime/runtime.go:55 +0xd8
panic({0x15eeba0, 0xc000175770})
    /usr/local/go/src/runtime/panic.go:1038 +0x215
github.com/project-codeflare/instascale/controllers.discoverInstanceTypes(0xc000501800)
    /workspace/controllers/appwrapper_controller.go:287 +0x285
github.com/project-codeflare/instascale/controllers.onAdd({0x16970c0, 0xc000501800})
    /workspace/controllers/appwrapper_controller.go:226 +0x11e
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
    /go/pkg/mod/k8s.io/client-go@v0.23.0/tools/cache/controller.go:231
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnAdd({0x1798b20, {0x18e63f8, 0xc000116540}}, {0x16970c0, 0xc000501800})
    /go/pkg/mod/k8s.io/client-go@v0.23.0/tools/cache/controller.go:264 +0x64
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
    /go/pkg/mod/k8s.io/client-go@v0.23.0/tools/cache/shared_informer.go:787 +0x9f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7fd823cf9fb8)
    /go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00007cf38, {0x18bfba0, 0xc00016bc50}, 0x1, 0xc0006167e0)
    /go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0000a8cc0, 0x3b9aca00, 0x0, 0x57, 0xc00007cf88)
    /go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
    /go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc00004c280)
    /go/pkg/mod/k8s.io/client-go@v0.23.0/tools/cache/shared_informer.go:781 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
    /go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:73 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
    /go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:71 +0x88

I'm still investigating and will post results here. Want to see if this happens with instascale=True, for example.

jbusche commented 1 year ago

With:

cluster = Cluster(ClusterConfiguration(name='jim-mnisttest', min_worker=2, max_worker=2, min_cpus=2, max_cpus=4, min_memory=8, max_memory=8, gpu=0, instascale=True, machine_types=["m5.xlarge", "p3.8xlarge"], auth=auth))

The Instascale pod doesn't restart, looks good:

oc logs -f instascale-9dcf85dcf-dlj4s
I0223 19:58:53.440552       1 request.go:665] Waited for 1.036641942s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/performance.openshift.io/v2?timeout=32s
1.6771823354946527e+09  INFO    controller-runtime.metrics  Metrics server is starting to listen    {"addr": ":8080"}
1.6771823354951782e+09  INFO    setup   starting manager
1.6771823354962528e+09  INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.6771823354962523e+09  INFO    Starting server {"kind": "health probe", "addr": "[::]:8081"}
1.6771823354964025e+09  INFO    controller.appwrapper   Starting EventSource    {"reconciler group": "mcad.ibm.com", "reconciler kind": "AppWrapper", "source": "kind source: *v1beta1.AppWrapper"}
1.6771823354964495e+09  INFO    controller.appwrapper   Starting Controller {"reconciler group": "mcad.ibm.com", "reconciler kind": "AppWrapper"}
1.6771823355975966e+09  INFO    controller.appwrapper   Starting workers    {"reconciler group": "mcad.ibm.com", "reconciler kind": "AppWrapper", "worker count": 1}

I0223 20:01:18.878272       1 appwrapper_controller.go:129] Got config map named: instascale-config that configures max nodes in cluster to value 15
I0223 20:01:18.982414       1 appwrapper_controller.go:223] Found Appwrapper named jim-mnisttest that has status Pending
I0223 20:01:18.982437       1 appwrapper_controller.go:214] The nodes allowed: 15 and total nodes in cluster after node scale-out 3
I0223 20:01:18.982442       1 appwrapper_controller.go:303] Completed Scaling for jim-mnisttest