[X] I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
I am unsure how this happened, but it seems that the ray-operator can't delete old clusters and this causes subsequent service updates to fail.
When updating the docker image, the operator seems to trigger a new raycluster deployment by creating a redis cleanup job that fails:
$ kubectl logs -n search-backend search-backend-raycluster-gpq8m-redis-cleanup-9lcjr
[2024-03-25 11:19:46,941 I 1 1] redis_context.cc:478: Resolve Redis address to 172.20.183.170
[2024-03-25 11:19:46,941 I 1 1] redis_context.cc:364: Attempting to connect to address 172.20.183.170:6379.
[2024-03-25 11:19:46,941 I 1 1] redis_context.cc:364: Attempting to connect to address redis:6379.
[2024-03-25 11:19:46,943 I 1 1] redis_context.cc:532: Redis cluster leader is 172.20.183.170:6379
[2024-03-25 11:19:46,944 I 1 1] redis_context.cc:478: Resolve Redis address to 172.20.183.170
[2024-03-25 11:19:46,944 I 1 1] redis_context.cc:364: Attempting to connect to address 172.20.183.170:6379.
[2024-03-25 11:19:46,944 I 1 1] redis_context.cc:364: Attempting to connect to address redis:6379.
[2024-03-25 11:19:46,947 I 1 1] redis_context.cc:532: Redis cluster leader is 172.20.183.170:6379
[2024-03-25 11:19:46,947 E 1 1] _raylet.cpp:926: Failed to delete 446d6c17-4426-49aa-bf5b-980bc479eecc
By listing the ray clusters I see there is a new ray cluster that never manages to start:
$ kubectl get raycluster -n search-backend
NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE
search-backend-raycluster-gpq8m 22h
search-backend-raycluster-nqsbl 1 1 ready 6d15h
The ray operator logs the following in a loop (truncated for readability):
$ kubectl logs -n search-backend kuberay-operator-search-fd5cd76d7-qxg2x
(...)
2024-03-26T09:24:56.333Z INFO controllers.RayService Reconciling the cluster component. {"ServiceName": "search-backend/search-backend"}
2024-03-26T09:24:56.334Z DEBUG controllers.RayService createRayClusterInstance {"rayClusterInstanceName": "search-backend-raycluster-gpq8m"}
2024-03-26T09:24:56.334Z DEBUG controllers.RayService Ray cluster already exists, config changes. Need to recreate. Delete the pending one now. {"key": "search-backend/search-backend-raycluster-gpq8m", "rayClusterInstance.Spec": {"headGroupSpec":{"rayStartParams":{"dashboard-host":"0.0.0.0","num-cpus":"0"},"template":{"metadata":{"creationTimestamp":nu (...TRUNCATED)
2024-03-26T09:24:56.364Z INFO controllers.RayService Done reconcileRayCluster update status, enter next loop to create new ray cluster. {"ServiceName": "search-backend/search-backend"}
Even if I delete the whole rayservice in kubernetes, the redis deployment and manually delete the kuberay-operator. These rayclusters are never deleted. After recreating the resources the clusters start to pile up:
$ kubectl get raycluster
NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE
search-backend-raycluster-gpq8m 2d6h
search-backend-raycluster-n292x 1 1 ready 15m
search-backend-raycluster-nqsbl 1 1 ready 7d23h
but only the newest cluster seems to have a cluster head:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
earch-backend-raycluster-n292x-worker-cpu-process-zts5d 1/1 Running 0 21m
kuberay-operator-search-fd5cd76d7-fl7pr 1/1 Running 0 23m
redis-76c84bbfbc-qrm2x 1/1 Running 0 23m
search-backend-raycluster-n292x-head-hjzpp 2/2 Running 0 22m
Reproduction script
I couldn't find a reproduction script for this behavior, and I am unsure how the system ended up in this state.
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
I am unsure how this happened, but it seems that the ray-operator can't delete old clusters and this causes subsequent service updates to fail.
When updating the docker image, the operator seems to trigger a new raycluster deployment by creating a redis cleanup job that fails:
By listing the ray clusters I see there is a new ray cluster that never manages to start:
The ray operator logs the following in a loop (truncated for readability):
Even if I delete the whole
rayservice
in kubernetes, the redis deployment and manually delete the kuberay-operator. These rayclusters are never deleted. After recreating the resources the clusters start to pile up:but only the newest cluster seems to have a cluster head:
Reproduction script
I couldn't find a reproduction script for this behavior, and I am unsure how the system ended up in this state.
Anything else
No response
Are you willing to submit a PR?