Closed gariepyalex closed 1 year ago
cc. @DmitriGekhtman
Thanks for all of the details! I will take a look.
Reproduced the issue with the provided config. Looking into causes.
I've identified the issue -- there's a bug stemming from inconsistency in the RayCluster controller's naming for the autoscaler's Role. The bug only occurs when the name of RayCluster is long enough, which is liable to happen with the RayCluster name generated by the RayService controller.
I will open a PR fixing the bug.
The short-term workaround is to use a shorter name for your RayService.
I was able to deploy the RayService successfully by shortening its name to "rxam".
Not to conflate issues, but we're also seeing an issue where the head-svc endpoint names are being truncated for names > 50 characters, could this be related?
Truncation is necessary due to K8s length limits. Let me go back and fix the issue with char length limits -- it slipped my mind.
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
When deploying a RayService, pods of the ray cluster are not being started if
enableInTreeAutoscaling
is true. I can see that the RayCluster and RayService resources exist in the Kubernetes cluster.Here are the logs of the operator:
Reproduction script
Note that the following RayService successfully deploys without
enableInTreeAutoscaling: true
Anything else
I'm using a namespace-scope operator and the nightly image of Kuberay
Are you willing to submit a PR?