[Bug] Readiness probe failed: timeout on minikube

anovv commented 5 months ago

Search before asking

[X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

After RayCluster is launched (about 40s) operator kills all worker pods due to failed readiness probe, nothing is restarted, only head node stays (which passes the probe okay). Events:

Readiness probe failed: command "bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success" timed out

Repeated for each worker pod which are then killed. Head node stays healthy, workers are not restarted.

Reproduction script

Launch minikube cluster
Install kuberay-operator via helm
Install RayCluster via helm
Wait 40s, watch all worker pods are terminated by raycluster-controller

Anything else

I'm running minikube inside colima on m2 mac. Tried different arm64 versions of kuberay operator (1.1.0 and 1.1.1) - same problem.

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

kevin85421 commented 5 months ago

Which Ray images are you using? You should use images that include aarch64 in the image tag.

anovv commented 5 months ago

@kevin85421 yes, I'm using aarch64 images, 2.22.0-py310-aarch64 for Ray to be exact

anovv commented 5 months ago

@kevin85421 do you have any idea what may be happening? This blocks me.

kevin85421 commented 5 months ago

I tried the following on my Mac M1, and my RayCluster is healthy; no pods have been killed.

kind create cluster
helm install kuberay-operator kuberay/kuberay-operator --version 1.1.1
helm install raycluster kuberay/ray-cluster --version 1.1.1 --set image.tag=2.22.0-py310-aarch64

We may have some differences: (1) kind vs minikube (2) M1 vs M2 (3) different instructions.
- You can try kind to determine whether the question is minikube-only or not.
- Use the exact the same instructions above in your environment.

Btw, are you in the Ray Slack channel? It will be helpful to join the Slack workspace. Other KubeRay users can also share their experiences. You can join #kuberay-questions channel.

anovv commented 5 months ago

@kevin85421 what container runtime do you use? Colima or Docker Desktop?

kevin85421 commented 5 months ago

I use Docker.

anovv commented 5 months ago

Ok @kevin85421, I think I found the culprit, some weird behaviour with worker.minReplicas parameters with enabled autoscaling head.enableInTreeAutoscaling: true

Example cases:

```
worker:
  replicas: 4
  minReplicas: 0
  maxReplicas: 1000 
```
I get 4 pods launched, then (about 60s) all 4 failing readiness probe and getting killed
```
worker:
  replicas: 4
  minReplicas: 2
  maxReplicas: 1000 
```
I get 4 pods launched, then (about 60s) 2 fail readiness probe and die, 2 stay healthy and work
If I set no min
```
worker:
  replicas: 4
  maxReplicas: 1000 
```
I get 4 pods launched, then (about 60s) 3 failing readiness probe and getting killed, 1 stays healthy and works
If I set worker.replicas = worker.minReplicas = 4, I get all 4 working properly.

Also noticed not setting worker.maxReplicas leads to a weird behaviour as well (number of pods does not match the request) and head node throws error with autoscaler not working properly

So I see two possible things here (which may be interconnected):

KubeRay uses worker.minReplicas as default when autoscaler is on after recovering from readiness probe fail (which is unexpected as it should use worker.replicas value)?
readiness probes fail only on pods not tracked by autoscaler (not sure why)?

Disabling enableInTreeAutoscaling makes everything work as expected.

What do you think?

ray-project / kuberay