rueian commented 1 month ago

Why are these changes needed?

The current normal k8s service for Ray Head works very well, except for a rare hairpin issue and a special need of reducing IP usages. However, given that the purpose of the normal k8s service is hiding multiple endpoints behind a shared virtual address, it does not make much sense in the context of Ray. A Ray cluster can only have one head node at any given time and all its worker nodes can not change their Head either. So a headless service, enabling direct access without a virtual IP, makes much sense to Ray.

This PR changes the default service for a Ray Head from a standard k8s service to a headless service. A headless service also provides A/AAAA DNS records under the same service name. So this change should not affect networking accesses through the service name. This change will also not affect currently running RayClusters because we will not recreate services. If users want to use the standard service, they must specify the Spec.HeadGroupSpec.HeadService explicitly.

Related issue number

948

Checks

[x] I've made sure the tests are passing.
Testing Strategy
- [x] Unit tests
- [x] Manual tests
- [ ] This PR is not tested :(

rueian commented 1 month ago

Hi @kevin85421, please let me know what else I haven't thought of that could be broken by this change.

rueian commented 1 month ago

Indeed in most cases, the failure is related to ray-cluster.external-redis-uri.yaml, but I suspect that is because it is always the first test depending on an external redis image. Changing the test order may have a different result.

I have tested ray-cluster.external-redis-uri.yaml locally many times, and I can't find any problems. Here are some thoughts:

From those failure logs, we can see that the kind cluster took 2 more seconds to download an external redis image while the ray image was already on the machine due to previous tests.

So, we know that the ray head started before redis, and showing retry messages of "Failed to connect to Redis" are expected.

However, the count of these retry messages is unexpected.

By using --tail=-1 to get all the head logs, we can find that the head only tried 4 or 5 times. Given that it took 500ms for each retry, it means, within 90 seconds, the actual execution time of Ray head could be only about 3 seconds. That is weird, but I am afraid that we may have no choice but to extend the timeout.

Kai-Hsun Chen @.***> 於 2024年5月7日週二上午12:37寫道：

@.**** commented on this pull request.

In tests/test_sample_raycluster_yamls.py https://github.com/ray-project/kuberay/pull/2117#discussion_r1591276111:

@@ -78,7 +78,7 @@ def parse_args(): logger.info('[SKIP TEST %d] %s: %s', index, new_cr['name'], skip_tests[new_cr['name']]) continue logger.info('[TEST %d]: %s', index, new_cr['name'])

addEvent = RayClusterAddCREvent(new_cr['cr'], [rs], 90, NAMESPACE, new_cr['path'])

addEvent = RayClusterAddCREvent(new_cr['cr'], [rs], 180, NAMESPACE, new_cr['path'])

We should not increase the timeout here if we don't know why the test can't finish in 90 seconds. This may obscure the actual root cause. There might be something happening under the hood, because in most cases, the failure is related to ray-cluster.external-redis-uri.yaml rather than occurring randomly across multiple tests. We don’t need to investigate the actual root cause in this PR, but we should still keep the timeout at 90 seconds until we understand what happened.

— Reply to this email directly, view it on GitHub https://github.com/ray-project/kuberay/pull/2117#discussion_r1591276111, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAUZ4322AMG7YPAXRO3DPP3ZA6WURAVCNFSM6AAAAABHG5HHE6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDANBRGIYTANRQGA . You are receiving this because you authored the thread.Message ID: @.***>

--

Ruei An

kevin85421 commented 1 month ago

@rueian thank you for your investigation about the flakiness! Let's chat in our meeting this week.

kevin85421 commented 1 month ago

cc @davidxia

davidxia commented 1 month ago

Couldn’t be better timing. My team needs this because we are running out of Service IPs.

kevin85421 commented 1 month ago

@davidxia, I want to make the headless service the default behavior because we will always have at most one Ray head pod. However, it represents a breaking change for users who use ClusterIP instead of the Kubernetes Service domain name. I believe most users do not use ClusterIP directly. Would you mind sharing your insights on this potential breaking change based on your hands-on production experience?

davidxia commented 1 month ago

Would you mind sharing your insights on this potential breaking change based on your hands-on production experience?

Most of our users use the head Pod IP to submit Ray jobs and for the Ray client. A smaller portion of users use a custom internal domain like my-ray-cluster.company.com which goes through a Google L7 internal load balancer -> Istio Gateway -> head Pod.

Istio's Gateway relies on an Istio VirtualService which relies on the head Pod's K8s Service to route HTTP requests to the Pod. We tested that this Istio routing behavior without kuberay continues to work with a headless Service.

Does this answer your question, @kevin85421?

kevin85421 commented 1 month ago

Thanks @davidxia!

kevin85421 commented 1 month ago

Sync with @rueian offline. We plan to make the headless service the default behavior, but we will still provide an option (e.g., an env var or annotations) to enable users to revert to the old behavior. cc @andrewsykim does this make sense for you? Thanks!

andrewsykim commented 1 month ago

cc @andrewsykim does this make sense for you?

sounds good thanks

rueian commented 1 month ago

Hi @kevin85421 and @andrewsykim, I have added the env ENABLE_RAY_HEAD_CLUSTER_IP_SERVICE to toggle the old behavior.

I guess this may also need to be documented somewhere. Where can I add a note about this?

ray-project / kuberay

[Feat][RayCluster] Make the Head service headless #2117

Why are these changes needed?

Related issue number

948

Checks

@.**** commented on this pull request.