Open rueian opened 1 month ago
Regarding the source of the delay when PublishNotReadyAddresses=false
, I have recorded a demo of creating a RayCluster with one worker in a fresh Kind environment:
https://github.com/ray-project/kuberay/assets/2727535/29c5e8a3-80de-463c-8b40-9ca2a79ddaa5
ray health-check
.The result is the RayCluster took about 100 seconds to be fully ready.
From the above timeline breakdown, we can see that most of the time was spent on multiple retires of ray health-check
.
To inspect what is happening underneath, we can add the GRPC_VERBOSITY=DEBUG
before doing the ray health-check
and preserve its output:
diff --git a/ray-operator/controllers/ray/common/pod.go b/ray-operator/controllers/ray/common/pod.go
index 9939e63..1d747b7 100644
--- a/ray-operator/controllers/ray/common/pod.go
+++ b/ray-operator/controllers/ray/common/pod.go
@@ -172,16 +172,19 @@ func DefaultWorkerPodTemplate(ctx context.Context, instance rayv1.RayCluster, wo
Args: []string{
fmt.Sprintf(`
SECONDS=0
while true; do
if (( SECONDS <= 120 )); then
+ export GRPC_VERBOSITY=DEBUG
- if ray health-check --address %s:%s > /dev/null 2>&1; then
+ if ray health-check --address %s:%s; then
break
fi
echo "$SECONDS seconds elapsed: Waiting for GCS to be ready."
Then we apply the RayCluster again to a fresh Kind environment:
https://github.com/ray-project/kuberay/assets/2727535/20a78408-3a2d-43dd-902f-52a26d4ace59
We now clearly confirm that the ray health-check
can fail multiple times with the Domain name not found
error which is indeed caused by PublishNotReadyAddresses=false
.
Regarding the second question: Should we also set PublishNotReadyAddresses=true for other service types?
My previous thought was that other service types would not raise the Domain name not found error
because they had a virtual IP, so there was no need to PublishNotReadyAddresses=true.
However, given that
I think we can safely set PublishNotReadyAddresses=true for all Head service types. WDYT @kevin85421? I'd value your feedback on this.
The worker took about 53 seconds to pass the ray health-check.
Interesting. I am very surprised that DNS registration takes much longer than my expectation, although the head Pod is already ready. Thank you for the investigation!
I think we can safely set PublishNotReadyAddresses=true for all Head service types.
Make sense to me.
Search before asking
Description
In the PR https://github.com/ray-project/kuberay/pull/2117, we found a Ray worker may have a serious delay in connecting to the Head through its headless service with the default
PublishNotReadyAddresses=false
. So we setPublishNotReadyAddresses=true
for headless services in the PR.This issue is meant to track the following questions:
PublishNotReadyAddresses=true
for other service types?Use case
https://github.com/ray-project/kuberay/pull/2117/files#r1603897666
Related issues
https://github.com/ray-project/kuberay/pull/2117
Are you willing to submit a PR?