Open YASHY2K opened 4 months ago
You can check Step 9 for more details https://docs.ray.io/en/master/cluster/kubernetes/user-guides/rayservice.html#step-9-why-1-worker-pod-isnt-ready.
If you are interested in contributing to Ray or KubeRay, you can open a PR to add a new issue in Ray's documentation and then add the link to step 9 in that section.
You can ping me for review.
Should I raise a PR on the repo? The only change I have made is to the ray service sample yaml file, Shouldn't break anything.
Instead of updating the YAML, I may prefer to update Step 4 to explain why the readiness probe failure is an expected behavior.
I have raised a PR for RayService Troubleshooting. Can you please check it?
Hello, I have seen this error while following the steps on Deploy on Kubernetes
Normal Created 2m26s kubelet Created container ray-worker
Normal Started 2m26s kubelet Started container ray-worker
Warning Unhealthy 48s (x19 over 2m13s) kubelet Readiness probe failed: success
I have tried the suggested solution but it seems to be working only for that use case.
I am still figuring out what is going on. The logs show me this:
$ klo -n kuberay rayservice-sample-raycluster-dq4cs-small-group-worker-jsc7n -c ray-worker
error: error from server (NotFound): pods "rayservice-sample-raycluster-dq4cs-small-group-worker-jsc7n" not found in namespace "kuberay"
ubuntu@ip-172-31-14-240:~$ klo -n kuberay rayservice-sample-raycluster-dq4cs-small-group-worker-rjnsd -c ray-worker
[2024-11-19 03:55:22,190 W 8 8] global_state_accessor.cc:465: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2024-11-19 03:55:23,193 W 8 8] global_state_accessor.cc:465: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
2024-11-19 03:55:22,088 INFO scripts.py:926 -- Local node IP: 10.1.34.83
2024-11-19 03:55:24,199 SUCC scripts.py:939 -- --------------------
2024-11-19 03:55:24,199 SUCC scripts.py:940 -- Ray runtime started.
2024-11-19 03:55:24,199 SUCC scripts.py:941 -- --------------------
2024-11-19 03:55:24,199 INFO scripts.py:943 -- To terminate the Ray runtime, run
2024-11-19 03:55:24,199 INFO scripts.py:944 -- ray stop
2024-11-19 03:55:24,199 INFO scripts.py:952 -- --block
2024-11-19 03:55:24,199 INFO scripts.py:953 -- This command will now block forever until terminated by a signal.
2024-11-19 03:55:24,199 INFO scripts.py:956 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
I have done some basic search on the message global_state_accessor.cc:465: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
but I have not found if that might be the cause of the error or not, I am not sure if it is a warning or not.
I will be updating this with the progress I make.
Update # 1 [13:27 11/19/2024]: It seems that tests of the samples are being skipped #2475 Update # 2 [13:09 11/22/2024]: Increased the resources (CPU and Memory) it doesn't seem to show any improvement. However when I test the application it works perfectly. One more thing I tested was using Bitnami's image and I got a different error in the logs. As it is a Warning and everything seems to work correct I am giving other more pressing matters a priority and I will get back to this.
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
When following the setup tutorial, the Step 3 is pointing to a incorrect 'ray-service.sample.yaml'. When applying the above file, the worker node crashes and the logs suggest that the readiness/liveness probe failed. The expected behaviour is as follows:
But in reality:
Reproduction script
Followed this tutorial YAML file
Anything else
No response
Are you willing to submit a PR?