ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.29k stars 414 forks source link

[Bug] Readiness and liveness probes failing when applying ray-service.sample.yaml file #2269

Open YASHY2K opened 4 months ago

YASHY2K commented 4 months ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

When following the setup tutorial, the Step 3 is pointing to a incorrect 'ray-service.sample.yaml'. When applying the above file, the worker node crashes and the logs suggest that the readiness/liveness probe failed. The expected behaviour is as follows:

ervice-sample-raycluster-6mj28-worker-small-group-kg4v5 1/1 Running 0 3m52s rayservice-sample-raycluster-6mj28-head-x77h4 1/1 Running 0 3m52s

But in reality:

ervice-sample-raycluster-6mj28-worker-small-group-kg4v5 0/1 Running 0 rayservice-sample-raycluster-6mj28-head-x77h4 1/1 Running 0 image

Reproduction script

Followed this tutorial YAML file

Anything else

No response

Are you willing to submit a PR?

kevin85421 commented 4 months ago

You can check Step 9 for more details https://docs.ray.io/en/master/cluster/kubernetes/user-guides/rayservice.html#step-9-why-1-worker-pod-isnt-ready.

kevin85421 commented 4 months ago

If you are interested in contributing to Ray or KubeRay, you can open a PR to add a new issue in Ray's documentation and then add the link to step 9 in that section.

kevin85421 commented 4 months ago

You can ping me for review.

YASHY2K commented 3 months ago

Should I raise a PR on the repo? The only change I have made is to the ray service sample yaml file, Shouldn't break anything.

kevin85421 commented 3 months ago

Instead of updating the YAML, I may prefer to update Step 4 to explain why the readiness probe failure is an expected behavior.

YASHY2K commented 3 months ago

I have raised a PR for RayService Troubleshooting. Can you please check it?

frivas-at-navteca commented 6 days ago

Hello, I have seen this error while following the steps on Deploy on Kubernetes

  Normal   Created    2m26s                 kubelet            Created container ray-worker
  Normal   Started    2m26s                 kubelet            Started container ray-worker
  Warning  Unhealthy  48s (x19 over 2m13s)  kubelet            Readiness probe failed: success

I have tried the suggested solution but it seems to be working only for that use case.

I am still figuring out what is going on. The logs show me this:

$ klo -n kuberay rayservice-sample-raycluster-dq4cs-small-group-worker-jsc7n -c ray-worker
error: error from server (NotFound): pods "rayservice-sample-raycluster-dq4cs-small-group-worker-jsc7n" not found in namespace "kuberay"
ubuntu@ip-172-31-14-240:~$ klo -n kuberay rayservice-sample-raycluster-dq4cs-small-group-worker-rjnsd -c ray-worker
[2024-11-19 03:55:22,190 W 8 8] global_state_accessor.cc:465: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2024-11-19 03:55:23,193 W 8 8] global_state_accessor.cc:465: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
2024-11-19 03:55:22,088 INFO scripts.py:926 -- Local node IP: 10.1.34.83
2024-11-19 03:55:24,199 SUCC scripts.py:939 -- --------------------
2024-11-19 03:55:24,199 SUCC scripts.py:940 -- Ray runtime started.
2024-11-19 03:55:24,199 SUCC scripts.py:941 -- --------------------
2024-11-19 03:55:24,199 INFO scripts.py:943 -- To terminate the Ray runtime, run
2024-11-19 03:55:24,199 INFO scripts.py:944 --   ray stop
2024-11-19 03:55:24,199 INFO scripts.py:952 -- --block
2024-11-19 03:55:24,199 INFO scripts.py:953 -- This command will now block forever until terminated by a signal.
2024-11-19 03:55:24,199 INFO scripts.py:956 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.

I have done some basic search on the message global_state_accessor.cc:465: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node? but I have not found if that might be the cause of the error or not, I am not sure if it is a warning or not.

I will be updating this with the progress I make.

Update # 1 [13:27 11/19/2024]: It seems that tests of the samples are being skipped #2475 Update # 2 [13:09 11/22/2024]: Increased the resources (CPU and Memory) it doesn't seem to show any improvement. However when I test the application it works perfectly. One more thing I tested was using Bitnami's image and I got a different error in the logs. As it is a Warning and everything seems to work correct I am giving other more pressing matters a priority and I will get back to this.