ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
990 stars 330 forks source link

[Bug] RayServe: Old ray clusters not being deleted cause a failure in new deployments #2048

Closed ezorita closed 2 months ago

ezorita commented 3 months ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

I am unsure how this happened, but it seems that the ray-operator can't delete old clusters and this causes subsequent service updates to fail.

When updating the docker image, the operator seems to trigger a new raycluster deployment by creating a redis cleanup job that fails:

$ kubectl logs -n search-backend search-backend-raycluster-gpq8m-redis-cleanup-9lcjr
[2024-03-25 11:19:46,941 I 1 1] redis_context.cc:478: Resolve Redis address to 172.20.183.170
[2024-03-25 11:19:46,941 I 1 1] redis_context.cc:364: Attempting to connect to address 172.20.183.170:6379.
[2024-03-25 11:19:46,941 I 1 1] redis_context.cc:364: Attempting to connect to address redis:6379.
[2024-03-25 11:19:46,943 I 1 1] redis_context.cc:532: Redis cluster leader is 172.20.183.170:6379
[2024-03-25 11:19:46,944 I 1 1] redis_context.cc:478: Resolve Redis address to 172.20.183.170
[2024-03-25 11:19:46,944 I 1 1] redis_context.cc:364: Attempting to connect to address 172.20.183.170:6379.
[2024-03-25 11:19:46,944 I 1 1] redis_context.cc:364: Attempting to connect to address redis:6379.
[2024-03-25 11:19:46,947 I 1 1] redis_context.cc:532: Redis cluster leader is 172.20.183.170:6379
[2024-03-25 11:19:46,947 E 1 1] _raylet.cpp:926: Failed to delete 446d6c17-4426-49aa-bf5b-980bc479eecc

By listing the ray clusters I see there is a new ray cluster that never manages to start:

$ kubectl get raycluster -n search-backend
NAME                              DESIRED WORKERS   AVAILABLE WORKERS   STATUS   AGE
search-backend-raycluster-gpq8m                                                  22h
search-backend-raycluster-nqsbl   1                 1                   ready    6d15h

The ray operator logs the following in a loop (truncated for readability):

$ kubectl logs -n search-backend kuberay-operator-search-fd5cd76d7-qxg2x
(...)
2024-03-26T09:24:56.333Z    INFO    controllers.RayService  Reconciling the cluster component.  {"ServiceName": "search-backend/search-backend"}
2024-03-26T09:24:56.334Z    DEBUG   controllers.RayService  createRayClusterInstance    {"rayClusterInstanceName": "search-backend-raycluster-gpq8m"}
2024-03-26T09:24:56.334Z    DEBUG   controllers.RayService  Ray cluster already exists, config changes. Need to recreate. Delete the pending one now.   {"key": "search-backend/search-backend-raycluster-gpq8m", "rayClusterInstance.Spec": {"headGroupSpec":{"rayStartParams":{"dashboard-host":"0.0.0.0","num-cpus":"0"},"template":{"metadata":{"creationTimestamp":nu (...TRUNCATED)
2024-03-26T09:24:56.364Z    INFO    controllers.RayService  Done reconcileRayCluster update status, enter next loop to create new ray cluster.  {"ServiceName": "search-backend/search-backend"}

Even if I delete the whole rayservice in kubernetes, the redis deployment and manually delete the kuberay-operator. These rayclusters are never deleted. After recreating the resources the clusters start to pile up:

$ kubectl get raycluster
NAME                              DESIRED WORKERS   AVAILABLE WORKERS   STATUS   AGE
search-backend-raycluster-gpq8m                                                  2d6h
search-backend-raycluster-n292x   1                 1                   ready    15m
search-backend-raycluster-nqsbl   1                 1                   ready    7d23h

but only the newest cluster seems to have a cluster head:

$ kubectl get pods
NAME                                                      READY   STATUS    RESTARTS   AGE
earch-backend-raycluster-n292x-worker-cpu-process-zts5d   1/1     Running   0          21m
kuberay-operator-search-fd5cd76d7-fl7pr                   1/1     Running   0          23m
redis-76c84bbfbc-qrm2x                                    1/1     Running   0          23m
search-backend-raycluster-n292x-head-hjzpp                2/2     Running   0          22m

Reproduction script

I couldn't find a reproduction script for this behavior, and I am unsure how the system ended up in this state.

Anything else

No response

Are you willing to submit a PR?

ezorita commented 2 months ago

After deleting and recreating the cluster I haven't been able to reproduce this anymore.