Open anovv opened 5 months ago
Which Ray images are you using? You should use images that include aarch64
in the image tag.
@kevin85421 yes, I'm using aarch64
images, 2.22.0-py310-aarch64
for Ray to be exact
@kevin85421 do you have any idea what may be happening? This blocks me.
I tried the following on my Mac M1, and my RayCluster is healthy; no pods have been killed.
kind create cluster
helm install kuberay-operator kuberay/kuberay-operator --version 1.1.1
helm install raycluster kuberay/ray-cluster --version 1.1.1 --set image.tag=2.22.0-py310-aarch64
kind
to determine whether the question is minikube-only or not.Btw, are you in the Ray Slack channel? It will be helpful to join the Slack workspace. Other KubeRay users can also share their experiences. You can join #kuberay-questions
channel.
@kevin85421 what container runtime do you use? Colima or Docker Desktop?
I use Docker.
Ok @kevin85421, I think I found the culprit, some weird behaviour with worker.minReplicas
parameters with enabled autoscaling head.enableInTreeAutoscaling: true
Example cases:
worker:
replicas: 4
minReplicas: 0
maxReplicas: 1000
I get 4 pods launched, then (about 60s) all 4 failing readiness probe and getting killed
worker:
replicas: 4
minReplicas: 2
maxReplicas: 1000
I get 4 pods launched, then (about 60s) 2 fail readiness probe and die, 2 stay healthy and work
If I set no min
worker:
replicas: 4
maxReplicas: 1000
I get 4 pods launched, then (about 60s) 3 failing readiness probe and getting killed, 1 stays healthy and works
If I set worker.replicas = worker.minReplicas = 4
, I get all 4 working properly.
Also noticed not setting worker.maxReplicas
leads to a weird behaviour as well (number of pods does not match the request) and head node throws error with autoscaler not working properly
So I see two possible things here (which may be interconnected):
worker.minReplicas
as default when autoscaler is on after recovering from readiness probe fail (which is unexpected as it should use worker.replicas
value)?Disabling enableInTreeAutoscaling
makes everything work as expected.
What do you think?
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
After RayCluster is launched (about 40s) operator kills all worker pods due to failed readiness probe, nothing is restarted, only head node stays (which passes the probe okay). Events:
Readiness probe failed: command "bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success" timed out
Repeated for each worker pod which are then killed. Head node stays healthy, workers are not restarted.
Reproduction script
Anything else
I'm running minikube inside colima on m2 mac. Tried different arm64 versions of kuberay operator (1.1.0 and 1.1.1) - same problem.
Are you willing to submit a PR?