ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.26k stars 404 forks source link

[Bug] RayService Restarts After Upgrading KubeRay to v1.2.0-rc.0 #2315

Closed ryanaoleary closed 2 months ago

ryanaoleary commented 2 months ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

Upgrading the kuberay-operator version to the v1.2.0-rc.0 release candidate from v1.1.1 causes a RayService to restart, reconciling the RayCluster and creating a new set of head and worker Pods. IIUC, expected behavior when installing a newer version of the Ray Operator is that running workloads should be unaffected. However, I'm observing behavior where the RayService restarts and the deployment is interrupted.

Relevant logs from the kuberay-operator after upgrading the image to v1.2.0-rc.0:

space":"default"},"reconcileID":"fc484444-80db-44d8-8652-38e4dff60878","activeClusterNumWorkerGroups":1,"goalNumWorkerGroups":1}
{"level":"info","ts":"2024-08-20T03:10:38.257Z","logger":"controllers.RayService","msg":"Active RayCluster config doesn't match goal config. RayService operator should prepare a new Ray cluster.\n* Active RayCluster config hash: D22UJ34PG57MTG9IDLDI0AAVJPLLE70D\n* Goal RayCluster config hash: OKN3DC1IOSSBST33DDOLIJR03TRSCB7A","RayService":{"name":"stable-diffusion-tpu","namespace":"default"},"reconcileID":"fc484444-80db-44d8-8652-38e4dff60878"}
{"level":"info","ts":"2024-08-20T03:10:38.257Z","logger":"controllers.RayService","msg":"Current cluster is unhealthy, prepare to restart.","RayService":{"name":"stable-diffusion-tpu","namespace":"default"},"reconcileID":"fc484444-80db-44d8-8652-38e4dff60878","Status":{"activeServiceStatus":{"applicationStatuses":{"stable_diffusion":{"status":"RUNNING","healthLastUpdateTime":"2024-08-20T03:10:08Z","serveDeploymentStatuses":{"APIIngress":{"status":"HEALTHY","healthLastUpdateTime":"2024-08-20T03:10:08Z"},"StableDiffusion":{"status":"HEALTHY","healthLastUpdateTime":"2024-08-20T03:10:08Z"}}}},"rayClusterName":"stable-diffusion-tpu-raycluster-lwbv5","rayClusterStatus":{"state":"ready","availableWorkerReplicas":1,"desiredWorkerReplicas":1,"minWorkerReplicas":1,"maxWorkerReplicas":10,"desiredCPU":"102","desiredMemory":"208G","desiredGPU":"0","desiredTPU":"4","lastUpdateTime":"2024-08-20T02:50:45Z","endpoints":{"client":"10001","dashboard":"8265","gcs":"6379","metrics":"8080","serve":"8000"},"head":{"podIP":"10.238.129.67","serviceIP":"10.238.129.67"},"observedGeneration":1}},"pendingServiceStatus":{"rayClusterStatus":{"desiredCPU":"0","desiredMemory":"0","desiredGPU":"0","desiredTPU":"0","head":{}}},"serviceStatus":"FailedToUpdateService","numServeEndpoints":2,"observedGeneration":1,"lastUpdateTime":"2024-08-20T02:35:15Z"}}
...
{"level":"error","ts":"2024-08-20T03:10:38.359Z","logger":"controllers.RayService","msg":"Fail to reconcileServe.","RayService":{"name":"stable-diffusion-tpu","namespace":"default"},"reconcileID":"9141c8e2-0804-49f5-b1d1-ff5532cf6539","error":"Found 0 head pods for RayCluster stable-diffusion-tpu-raycluster-7rxg6 in the namespace default","stacktrace":"github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayservice_controller.go:163\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":"2024-08-20T03:10:38.360Z","logger":"controllers.RayService","msg":"Reconciling the Serve component. Active and pending Ray clusters exist.","RayService":{"name":"stable-diffusion-tpu","namespace":"default"},"reconcileID":"2874b819-f616-4ab7-a288-d4667f35b6cf"}
{"level":"info","ts":"2024-08-20T03:10:38.360Z","logger":"controllers.RayService","msg":"FetchHeadServiceURL","RayService":{"name":"stable-diffusion-tpu","namespace":"default"},"reconcileID":"2874b819-f616-4ab7-a288-d4667f35b6cf","head service name":"stable-diffusion-tpu-raycluster-lwbv5-head-svc","namespace":"default"}
{"level":"info","ts":"2024-08-20T03:10:38.360Z","logger":"controllers.RayService","msg":"FetchHeadServiceURL","RayService":{"name":"stable-diffusion-tpu","namespace":"default"},"reconcileID":"2874b819-f616-4ab7-a288-d4667f35b6cf","head service URL":"stable-diffusion-tpu-raycluster-lwbv5-head-svc.default.svc.cluster.local:8265","port":"dashboard"}

Reproduction script

Steps to reproduce:

  1. Install a previous version of KubeRay

    helm repo add kuberay https://ray-project.github.io/kuberay-helm/
    helm repo update
    helm install kuberay-operator kuberay/kuberay-operator --version 1.1.1
  2. Create a RayService

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/kubernetes-engine-samples/main/ai-ml/gke-ray/rayserve/stable-diffusion/ray-service.yaml
  3. Verify the RayService deploys successfully

    
    kubectl get rayservices

verify RayService is running


4. Upgrade to a newer version of the KubeRay operator

kubectl replace -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.2.0-rc.0"

helm upgrade kuberay-operator kuberay/kuberay-operator --version 1.2.0-rc.0

5. The RayService restarts and creates a new Ray head Pod and worker Pods

kubectl get rayservices

RayService status is restarting and RayCluster is recreated



### Anything else

This occurs every time when upgrading the ray-operator image version to `v1.2.0-rc.0`, but not when installing a KubeRay operator with the same or lower version as the currently installed version

### Are you willing to submit a PR?

- [X] Yes I am willing to submit a PR!
andrewsykim commented 2 months ago

I think there's one step missing in your upgrade per https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/upgrade-guide.html#upgrade-kuberay:

kubectl replace -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.2.0-rc.0"

Is the issue still reproducible if you ran this before the helm install kuberay-operator kuberay/kuberay-operator --version 1.2.0-rc.0?

ryanaoleary commented 2 months ago

I think there's one step missing in your upgrade per https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/upgrade-guide.html#upgrade-kuberay:

kubectl replace -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.2.0-rc.0"

Is the issue still reproducible if you ran this before the helm install kuberay-operator kuberay/kuberay-operator --version 1.2.0-rc.0?

I just tried it with:

kubectl replace -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.2.0-rc.0"

helm upgrade kuberay-operator kuberay/kuberay-operator --version 1.2.0-rc.0

to upgrade KubeRay and saw the same issue. It looks like replacing the CRDs first does not fix the problem. rayservice_reconciling

andrewsykim commented 2 months ago

I wonder if the hash mismatch is due to https://github.com/ray-project/kuberay/pull/2144 where we re-organized fields in RayCluster to adhere to golangci-lint.

ryanaoleary commented 2 months ago

I think I've verified the hash mismatch is due to the golangci-lint changes - reverting the RayCluster Spec to the same fields/order as v1.1.1 and re-deploying the operator with that image and CRDs does not cause the RayService to restart. As suggested offline, I think a potential fix is to check for a v1.2.0 annotation that's set when the operator is upgraded, and generate a new hash for activeRayCluster.ObjectMeta.Annotations[utils.HashWithoutReplicasAndWorkersToDeleteKey] when found. This will fix the issue in shouldPrepareNewRayCluster where a new RayCluster is being created due to the v1.2.0 generated hash differing, even though no changes have been made to the RayCluster. cc: @kevin85421

kevin85421 commented 2 months ago

Thanks, @ryanaoleary! Great find!

As we discussed this morning, there are two solutions.

@andrewsykim @ryanaoleary, would you mind sharing your thoughts on these two solutions? Thanks!

kevin85421 commented 2 months ago

Also cc @MortalHappiness because this issue is related to golangci-lint changes https://github.com/ray-project/kuberay/issues/2315#issuecomment-2299842715

ryanaoleary commented 2 months ago

Thanks, @ryanaoleary! Great find!

As we discussed this morning, there are two solutions.

  • Solution 1: Sort and then generate the hash for the utils.HashWithoutReplicasAndWorkersToDeleteKey annotation. My only concern is whether this method will still work when we add a new field to the CRD.
  • Solution 2: Add a new annotation or label ray.io/kuberay-version. If the value of ray.io/kuberay-version doesn't exist or is different from the version of the KubeRay operator pod, we update the value of utils.HashWithoutReplicasAndWorkersToDeleteKey first and then add or update the value of ray.io/kuberay-version before doing zero-downtime upgrade.

@andrewsykim @ryanaoleary, would you mind sharing your thoughts on these two solutions? Thanks!

I think solution 2 makes the most sense to me. For solution 1, we'd ensure that newly generated hashes are consistent regardless of the order of fields in the RayCluster spec, but for existing RayClusters the value of utils.HashWithoutReplicasAndWorkersToDeleteKey would not match and the RayService would still have to restart when upgrading KubeRay versions.

kevin85421 commented 2 months ago

I think solution 2 makes the most sense to me. For solution 1, we'd ensure that newly generated hashes are consistent regardless of the order of fields in the RayCluster spec, but for existing RayClusters the value of utils.HashWithoutReplicasAndWorkersToDeleteKey would not match and the RayService would still have to restart when upgrading KubeRay versions.

SGTM

MortalHappiness commented 2 months ago

@kevin85421

Solution 1 will not work if we add a new field to the CRD no matter we sort the fields in the CRD or not. Because under the hood it first json.Marshal the object and then use sha1 to hash it, the additional field will definitely result in a different hash no matter fileds are sorted or not. Furthermore, if we add a new field to the CRD, the version of the CRD should be changed right? We can handle different version of the CR separately, so this may not be an issue.

Solution 2 makes a lot of senses and sounds good to me. It gives us flexibility when upgrading KubeRay operator because in the future we can update other fields or annotations during the upgrading too. My only concern is that we need to update annotations for all existing CRs, I don't know whether it will cause performance issue or not. But it is a must for v1.2.0. For future versions, if there are performance issues, maybe we can update ray.io/kuberay-version only if it is inconsistent with the current version of the KubeRay operator and some annotations or fields are not compatible with current version of the operator.

kevin85421 commented 2 months ago

My only concern is that we need to update annotations for all existing CRs, I don't know whether it will cause performance issue or not.

I think it is fine.

andrewsykim commented 2 months ago

Solution 1 will not work if we add a new field to the CRD no matter we sort the fields in the CRD or not. Because under the hood it first json.Marshal the object and then use sha1 to hash it, the additional field will definitely result in a different hash no matter fileds are sorted or not. Furthermore, if we add a new field to the CRD, the version of the CRD should be changed right? We can handle different version of the CR separately, so this may not be an issue.

Worth noting that if additional fields are added as pointers with default value of nil, it will not change the hash on upgrade.