Closed ryanaoleary closed 2 months ago
I think there's one step missing in your upgrade per https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/upgrade-guide.html#upgrade-kuberay:
kubectl replace -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.2.0-rc.0"
Is the issue still reproducible if you ran this before the helm install kuberay-operator kuberay/kuberay-operator --version 1.2.0-rc.0
?
I think there's one step missing in your upgrade per https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/upgrade-guide.html#upgrade-kuberay:
kubectl replace -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.2.0-rc.0"
Is the issue still reproducible if you ran this before the
helm install kuberay-operator kuberay/kuberay-operator --version 1.2.0-rc.0
?
I just tried it with:
kubectl replace -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.2.0-rc.0"
helm upgrade kuberay-operator kuberay/kuberay-operator --version 1.2.0-rc.0
to upgrade KubeRay and saw the same issue. It looks like replacing the CRDs first does not fix the problem.
I wonder if the hash mismatch is due to https://github.com/ray-project/kuberay/pull/2144 where we re-organized fields in RayCluster to adhere to golangci-lint.
I think I've verified the hash mismatch is due to the golangci-lint changes - reverting the RayCluster Spec to the same fields/order as v1.1.1
and re-deploying the operator with that image and CRDs does not cause the RayService to restart. As suggested offline, I think a potential fix is to check for a v1.2.0
annotation that's set when the operator is upgraded, and generate a new hash for activeRayCluster.ObjectMeta.Annotations[utils.HashWithoutReplicasAndWorkersToDeleteKey]
when found. This will fix the issue in shouldPrepareNewRayCluster
where a new RayCluster is being created due to the v1.2.0
generated hash differing, even though no changes have been made to the RayCluster. cc: @kevin85421
Thanks, @ryanaoleary! Great find!
As we discussed this morning, there are two solutions.
Solution 1: Sort and then generate the hash for the utils.HashWithoutReplicasAndWorkersToDeleteKey
annotation. My only concern is whether this method will still work when we add a new field to the CRD.
Solution 2: Add a new annotation or label ray.io/kuberay-version
. If the value of ray.io/kuberay-version
doesn't exist or is different from the version of the KubeRay operator pod, we update the value of utils.HashWithoutReplicasAndWorkersToDeleteKey
first and then add or update the value of ray.io/kuberay-version
before doing zero-downtime upgrade.
@andrewsykim @ryanaoleary, would you mind sharing your thoughts on these two solutions? Thanks!
Also cc @MortalHappiness because this issue is related to golangci-lint
changes https://github.com/ray-project/kuberay/issues/2315#issuecomment-2299842715
Thanks, @ryanaoleary! Great find!
As we discussed this morning, there are two solutions.
- Solution 1: Sort and then generate the hash for the
utils.HashWithoutReplicasAndWorkersToDeleteKey
annotation. My only concern is whether this method will still work when we add a new field to the CRD.- Solution 2: Add a new annotation or label
ray.io/kuberay-version
. If the value ofray.io/kuberay-version
doesn't exist or is different from the version of the KubeRay operator pod, we update the value ofutils.HashWithoutReplicasAndWorkersToDeleteKey
first and then add or update the value ofray.io/kuberay-version
before doing zero-downtime upgrade.@andrewsykim @ryanaoleary, would you mind sharing your thoughts on these two solutions? Thanks!
I think solution 2 makes the most sense to me. For solution 1, we'd ensure that newly generated hashes are consistent regardless of the order of fields in the RayCluster spec, but for existing RayClusters the value of utils.HashWithoutReplicasAndWorkersToDeleteKey
would not match and the RayService would still have to restart when upgrading KubeRay versions.
I think solution 2 makes the most sense to me. For solution 1, we'd ensure that newly generated hashes are consistent regardless of the order of fields in the RayCluster spec, but for existing RayClusters the value of utils.HashWithoutReplicasAndWorkersToDeleteKey would not match and the RayService would still have to restart when upgrading KubeRay versions.
SGTM
@kevin85421
Solution 1 will not work if we add a new field to the CRD no matter we sort the fields in the CRD or not. Because under the hood it first json.Marshal
the object and then use sha1
to hash it, the additional field will definitely result in a different hash no matter fileds are sorted or not. Furthermore, if we add a new field to the CRD, the version of the CRD should be changed right? We can handle different version of the CR separately, so this may not be an issue.
Solution 2 makes a lot of senses and sounds good to me. It gives us flexibility when upgrading KubeRay operator because in the future we can update other fields or annotations during the upgrading too. My only concern is that we need to update annotations for all existing CRs, I don't know whether it will cause performance issue or not. But it is a must for v1.2.0
. For future versions, if there are performance issues, maybe we can update ray.io/kuberay-version
only if it is inconsistent with the current version of the KubeRay operator and some annotations or fields are not compatible with current version of the operator.
My only concern is that we need to update annotations for all existing CRs, I don't know whether it will cause performance issue or not.
I think it is fine.
Solution 1 will not work if we add a new field to the CRD no matter we sort the fields in the CRD or not. Because under the hood it first json.Marshal the object and then use sha1 to hash it, the additional field will definitely result in a different hash no matter fileds are sorted or not. Furthermore, if we add a new field to the CRD, the version of the CRD should be changed right? We can handle different version of the CR separately, so this may not be an issue.
Worth noting that if additional fields are added as pointers with default value of nil, it will not change the hash on upgrade.
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
Upgrading the kuberay-operator version to the
v1.2.0-rc.0
release candidate fromv1.1.1
causes a RayService to restart, reconciling the RayCluster and creating a new set of head and worker Pods. IIUC, expected behavior when installing a newer version of the Ray Operator is that running workloads should be unaffected. However, I'm observing behavior where the RayService restarts and the deployment is interrupted.Relevant logs from the kuberay-operator after upgrading the image to
v1.2.0-rc.0
:Reproduction script
Steps to reproduce:
Install a previous version of KubeRay
Create a RayService
Verify the RayService deploys successfully
verify RayService is running
kubectl replace -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.2.0-rc.0"
helm upgrade kuberay-operator kuberay/kuberay-operator --version 1.2.0-rc.0
kubectl get rayservices
RayService status is
restarting
and RayCluster is recreated