Closed jbusche closed 1 year ago
With:
cluster = Cluster(ClusterConfiguration(name='jim-mnisttest', min_worker=2, max_worker=2, min_cpus=2, max_cpus=4, min_memory=8, max_memory=8, gpu=0, instascale=True, machine_types=["m5.xlarge", "p3.8xlarge"], auth=auth))
The Instascale pod doesn't restart, looks good:
oc logs -f instascale-9dcf85dcf-dlj4s
I0223 19:58:53.440552 1 request.go:665] Waited for 1.036641942s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/performance.openshift.io/v2?timeout=32s
1.6771823354946527e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"}
1.6771823354951782e+09 INFO setup starting manager
1.6771823354962528e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.6771823354962523e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}
1.6771823354964025e+09 INFO controller.appwrapper Starting EventSource {"reconciler group": "mcad.ibm.com", "reconciler kind": "AppWrapper", "source": "kind source: *v1beta1.AppWrapper"}
1.6771823354964495e+09 INFO controller.appwrapper Starting Controller {"reconciler group": "mcad.ibm.com", "reconciler kind": "AppWrapper"}
1.6771823355975966e+09 INFO controller.appwrapper Starting workers {"reconciler group": "mcad.ibm.com", "reconciler kind": "AppWrapper", "worker count": 1}
I0223 20:01:18.878272 1 appwrapper_controller.go:129] Got config map named: instascale-config that configures max nodes in cluster to value 15
I0223 20:01:18.982414 1 appwrapper_controller.go:223] Found Appwrapper named jim-mnisttest that has status Pending
I0223 20:01:18.982437 1 appwrapper_controller.go:214] The nodes allowed: 15 and total nodes in cluster after node scale-out 3
I0223 20:01:18.982442 1 appwrapper_controller.go:303] Completed Scaling for jim-mnisttest
I just realized that the Instascale pod is crashing and restarting itself after I issue a cluster.up(). My current cluster config is the following:
And after the cluster.up is submitted, if I'm following the instascale pod, I'll see it panic and then restart, like this:
I'm still investigating and will post results here. Want to see if this happens with instascale=True, for example.