ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
963 stars 328 forks source link

[Bug] [API Server] JobSubmission service does not work for cluster names >41 characters #2169

Open smit-kiri opened 1 month ago

smit-kiri commented 1 month ago

Search before asking

KubeRay Component

apiserver

What happened + What you expected to happen

The JobSubmission service adds -head-svc to the cluster name here to get the kubernetes service name. However, KubeRay trims the head service name to 50 characters. So, if the RayCluster name is >41 characters, the service name will be truncated from the beginning to get to the 50 characters limit.

For example, if the RayCluster name is 82ac5612-3bd9-4ef9-8828-5133dfe1a1fa-raycluster-pfsxh, the head service name becomes r-4ef9-8828-5133dfe1a1fa-raycluster-pfsxh-head-svc

But when we pass the cluster name in the Job Submission service, it raises the following error

dial tcp: lookup 82ac5612-3bd9-4ef9-8828-5133dfe1a1fa-raycluster-pfsxh-head-svc.default.svc.cluster.local on 10.96.0.10:53: no such host

Reproduction script

Create a ray cluster with a name >41 characters and use the job submission service.

Anything else

No response

Are you willing to submit a PR?

smit-kiri commented 1 month ago

Ideally it uses the same logic to get the service name as here: https://github.com/ray-project/kuberay/blob/aeb8b03dfedcebf5105352a552b164afc5bfdbfb/ray-operator/controllers/ray/utils/util.go#L109-L131