ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.26k stars 408 forks source link

[Bug] Readiness probe failed: timeout on minikube #2158

Open anovv opened 5 months ago

anovv commented 5 months ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

After RayCluster is launched (about 40s) operator kills all worker pods due to failed readiness probe, nothing is restarted, only head node stays (which passes the probe okay). Events:

Readiness probe failed: command "bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success" timed out

Repeated for each worker pod which are then killed. Head node stays healthy, workers are not restarted.

Reproduction script

  1. Launch minikube cluster
  2. Install kuberay-operator via helm
  3. Install RayCluster via helm
  4. Wait 40s, watch all worker pods are terminated by raycluster-controller

Anything else

I'm running minikube inside colima on m2 mac. Tried different arm64 versions of kuberay operator (1.1.0 and 1.1.1) - same problem.

Are you willing to submit a PR?

kevin85421 commented 5 months ago

Which Ray images are you using? You should use images that include aarch64 in the image tag.

anovv commented 5 months ago

@kevin85421 yes, I'm using aarch64 images, 2.22.0-py310-aarch64 for Ray to be exact

anovv commented 5 months ago

@kevin85421 do you have any idea what may be happening? This blocks me.

kevin85421 commented 5 months ago

I tried the following on my Mac M1, and my RayCluster is healthy; no pods have been killed.

kind create cluster
helm install kuberay-operator kuberay/kuberay-operator --version 1.1.1
helm install raycluster kuberay/ray-cluster --version 1.1.1 --set image.tag=2.22.0-py310-aarch64

Btw, are you in the Ray Slack channel? It will be helpful to join the Slack workspace. Other KubeRay users can also share their experiences. You can join #kuberay-questions channel.

anovv commented 5 months ago

@kevin85421 what container runtime do you use? Colima or Docker Desktop?

kevin85421 commented 5 months ago

I use Docker.

anovv commented 5 months ago

Ok @kevin85421, I think I found the culprit, some weird behaviour with worker.minReplicas parameters with enabled autoscaling head.enableInTreeAutoscaling: true

Example cases:

Also noticed not setting worker.maxReplicas leads to a weird behaviour as well (number of pods does not match the request) and head node throws error with autoscaler not working properly

So I see two possible things here (which may be interconnected):

Disabling enableInTreeAutoscaling makes everything work as expected.

What do you think?