ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
974 stars 330 forks source link

[Feature] Should we also set PublishNotReadyAddresses if the service is not headless? #2157

Open rueian opened 1 month ago

rueian commented 1 month ago

Search before asking

Description

In the PR https://github.com/ray-project/kuberay/pull/2117, we found a Ray worker may have a serious delay in connecting to the Head through its headless service with the default PublishNotReadyAddresses=false. So we set PublishNotReadyAddresses=true for headless services in the PR.

This issue is meant to track the following questions:

  1. Determining the source of the delay. Why a RayCluster with one worker node with a default headless service can take 100 seconds to be fully ready in a fresh kind cluster?
  2. Should we also set PublishNotReadyAddresses=true for other service types?

Use case

https://github.com/ray-project/kuberay/pull/2117/files#r1603897666

Related issues

https://github.com/ray-project/kuberay/pull/2117

Are you willing to submit a PR?

rueian commented 1 month ago

Regarding the source of the delay when PublishNotReadyAddresses=false, I have recorded a demo of creating a RayCluster with one worker in a fresh Kind environment:

Full Demo Recording:

https://github.com/ray-project/kuberay/assets/2727535/29c5e8a3-80de-463c-8b40-9ca2a79ddaa5

Timeline Breakdown:

  1. The Kind took about 36 seconds to download the Ray image.
  2. The worker took about 53 seconds to pass the ray health-check.
  3. The worker took another 10 seconds to pass the k8s readiness check.

Result Screenshot:

image

The result is the RayCluster took about 100 seconds to be fully ready.

Root Cause Analysis

From the above timeline breakdown, we can see that most of the time was spent on multiple retires of ray health-check.

To inspect what is happening underneath, we can add the GRPC_VERBOSITY=DEBUG before doing the ray health-check and preserve its output:

diff --git a/ray-operator/controllers/ray/common/pod.go b/ray-operator/controllers/ray/common/pod.go
index 9939e63..1d747b7 100644
--- a/ray-operator/controllers/ray/common/pod.go
+++ b/ray-operator/controllers/ray/common/pod.go
@@ -172,16 +172,19 @@ func DefaultWorkerPodTemplate(ctx context.Context, instance rayv1.RayCluster, wo
            Args: []string{
                fmt.Sprintf(`
                    SECONDS=0
                    while true; do
                        if (( SECONDS <= 120 )); then
+                           export GRPC_VERBOSITY=DEBUG
-                           if ray health-check --address %s:%s > /dev/null 2>&1; then
+                           if ray health-check --address %s:%s; then
                                break
                            fi
                            echo "$SECONDS seconds elapsed: Waiting for GCS to be ready."

Then we apply the RayCluster again to a fresh Kind environment:

Full Debug Recording:

https://github.com/ray-project/kuberay/assets/2727535/20a78408-3a2d-43dd-902f-52a26d4ace59

Debug Screenshot:

image

We now clearly confirm that the ray health-check can fail multiple times with the Domain name not found error which is indeed caused by PublishNotReadyAddresses=false.

rueian commented 1 month ago

Regarding the second question: Should we also set PublishNotReadyAddresses=true for other service types?

My previous thought was that other service types would not raise the Domain name not found error because they had a virtual IP, so there was no need to PublishNotReadyAddresses=true.

However, given that

  1. KubeRay guarantees a Head service will apply to only one Pod at any given time. No worry about accessing a wrong Head.
  2. Allowing users to access the Ray dashboard for troubleshooting through the service when readiness fails is a big use case.

I think we can safely set PublishNotReadyAddresses=true for all Head service types. WDYT @kevin85421? I'd value your feedback on this.

kevin85421 commented 1 month ago

The worker took about 53 seconds to pass the ray health-check.

Interesting. I am very surprised that DNS registration takes much longer than my expectation, although the head Pod is already ready. Thank you for the investigation!

I think we can safely set PublishNotReadyAddresses=true for all Head service types.

Make sense to me.