Closed wilsonwang371 closed 8 months ago
This would be great to have. Let's figure out a design... Would we be aiming for something similar to rollouts for Deployments?
Note: we need to review #231 and make a new design.
Would we be aiming for something similar to rollouts for Deployments?
@DmitriGekhtman I think we can separate the update for two role:
MaxUnavailable
or MaxSurge
but in ray we can have only head node with no worker, so we need to define some new behaviors here me and Wilson will come up with detail design and we may have several round discussion.
I like the strategy of splitting the discussion (and potentially even implementation) into updates for head and updates for worker.
cc @brucez-anyscale for the head node HA aspect. Stating the question again: What should happen when you change the configuration for a RayCluster's head pod?
here me and Wilson will come up with detail design and we may have several round discussion.
That's great! I'm looking forward to discussing the design of this functionality -- I think it's very important.
Right now. RayService does the whole cluster level upgrading, so RayService works itself for now. RayCluster rolling upgrade: I think the head node or worker node should be backward compatible, so they can join back the Ray cluster.
@wilsonwang371 Here I think we need to find the exact use case that user can benefit from the feature.
First is the user behavior, following the previous discussion, we can make the assumption that in this story:
In all of those cases, we need to enable the mechanism that the ray package in images is compatible.
Here are some scenarios that I can think about:
raycluster
In this case, we would not need the feature since the recreate strategy would be enough, the only modification is to enable the worker upgrade in the reconcile.
raycluster
, some remaining actors inside the old one.here the situation is a little bit tricky since we need to support mechanisms in ray that migrate actors from old raycluster
to the new one.
raycluster
, just as @brucez-anyscale said the whole cluster would upgradeThis case is the most possible to have the rolling upgrade feature. Since for now we may recreate a brand new raycluster
by rayservice controller
. we may support the rolling upgrade in raycluster controller
to ease the ray service upgrade.
Indeed we need support standard update semantic for raycluster
, at least in recreate strategy. However, for now, consider those cases, would the feature raycluster rolling upgrade
bring any further significant benefit to the user? WDYT @DmitriGekhtman
Let's first consider the most basic use-case that we were going for with the --forced-cluster-upgrade flag.
When a user updates a RayCluster CR and applies it, they expect changes to pod configs to be reflected in the actual pod configuration, even if the change is potentially disruptive to the Ray workload. If you update a workerGroupSpec, workers with outdated configuration should be eliminated and workers with update configuration should be created. Same thing for the HeadGroupSpec.
The ability to do (destructive) updates is available with the Ray Autoscaler's VM node providers and with the legacy python-based Ray operator. The implementation for this uses hashes of last-applied node configuration. We could potentially do the same thing here.
If Ray versions mismatch, things won't work out, no matter what, because Ray does not have cross-version compatibility. If workloads are running, they may be interrupted. These are complex, higher-order concerns, but we can start by just registering pod config updates.
Question - why to create pod's directly and not Deployment
- which would handle this ? (side note - not familiar with ray in particular, just an operator of a kubernetes cluster where ray is deployed)
I am curious if there has been any update on this feature or do we have any plans?
If we are worried that we do not have a strong use case to focus on this, I can help. Not have rolling upgrade is a real pain for us. I am talking from the perspective of a ML platform that supports all ML teams within a company.
I am happy to discuss more on this, or help any way I can.
@jhasm I don't want to speak for others, but believe serve will be critical to ensure 100% uptime during upgrades of Ray Cluster versions. The way a model is served shouldn't hinder the upgrade i.e serve cli, sdk, etc. I had some thoughts I wanted to share.
There may be opportunities to enable cluster version rolling upgrades using Ray's GCS external Redis.
A potential starting point may be to detect when the Ray Cluster version changes. If the version changes and the cluster name is currently deployed, then launch a new Ray cluster. Once jobs are transferred, have kuberay rewrite the service to point to the new cluster. I believe the more complex portion is transferring the jobs and actors to the new cluster.
Keep the head service and serve service with the same name.
Any update on this? Lack of rolling-update is like a no-go for many production serving workloads.
The RayService custom resource is intended to support the upgrade semantics of the sort people in this thread are looking for.
An individual Ray Cluster should be thought of as a massive pod -- there is not a coherent way to conduct a rolling upgrade of a single Ray cluster (though actually some large enterprises have actually managed to achieve this)
tl;dr solutions for upgrades require multiple Ray clusters
In my experience, doing anything "production-grade" with Ray requires multiple Ray clusters and external orchestration.
@qizzzh, I just saw your message. As @DmitriGekhtman mentioned, upgrading Ray involves more than one RayCluster. For RayService, we plan to support incremental upgrades, meaning that we won't need a new, large RayCluster for a zero-downtime upgrade. Instead, we will gradually increase the size of the new RayCluster and decrease the size of the old one. If you want to chat more, feel free to reach out to me on Slack.
Ray doesn't natively support rolling upgrade. It is impossible for KubeRay to achieve that (in the single RayCluster). This issue should move to Ray instead of KubeRay. Close this issue. I will open new issues to track incremental upgrade when I start to work on it.
Hi @kevin85421 , is there any progress on this? or any tracking issue created? so we can check whether the incremental upgrade effort has been started or no. Thanks a lot!
@zzb54321 there have been some discussions but no work started. I am willing to start a one-pager proposal on this effort. @kevin85421 any objections?
@zzb54321 instead of an incremental upgrade, the community recently prefers an N+1 upgrade for now. See https://github.com/ray-project/kuberay/issues/2274 for more details.
I am willing to start a one-pager proposal on this effort.
sounds good!
@kevin85421 what's an N+1 upgrade?
what's an N+1 upgrade?
RayService manages multiple (N) small RayCluster CRs simultaneously. When we need to upgrade the RayService CR, it creates a new small RayCluster CR and then tears down an old RayCluster CR.
You can think of it like a K8s Deployment, where each Pod in the Deployment is a 1-node RayCluster. Then, use the K8s rolling upgrade mechanism to upgrade the K8s Deployment.
Gotcha! That makes a lot of sense. I'll follow https://github.com/ray-project/kuberay/issues/2274
Do you think it would be possible to set e.g. an environment variable to be different in each small cluster automatically? We've been thinking about sharding our current one large cluster into multiple smaller clusters to handle increasing scale (probably roughly what is being referred to here) - it would be nice if we could do that via this mechanism so that we didn't have to manage that ourselves!
@JoshKarpel Would you mind explaining why you need to have different environment variables for different small RayCluster CRs? For the short term, I plan to make RayService more similar to a K8s Deployment (where each Pod has the same spec) instead of a K8s StatefulSet. That is, I prefer to make all RayCluster CRs that belong to the same RayService CR have the same spec. If we make it stateful, I think the complexity will increase a lot.
Oh, sorry, yes, I should have said why!
Our goal here would be to shard a set of dynamically-created Serve applications (reconciled with our ML model store) across multiple clusters. Right now, we deploy the Serve applications from inside the cluster itself, so each cluster would need to know which shard it should be (e.g., to then do consistent hashing on the metadata that defines the Serve apps, so it knows which apps to create in itself).
We don't deploy the apps through the RayService CR because we don't want KubeRay to consider them when determining the health of the cluster (see https://github.com/ray-project/ray/issues/44226).
That said - short term, your plan totally makes sense, and I agree that it will be much simpler! Once we have that maybe we can work on extending it to add some stateful-ness. By then maybe we'll have played with it in our setup and have something we could upstream.
A few question about N+1 upgrade. Assume the RayService CR defines multiple applications. In the context of N+1 upgrade, there will be multiple small RayCluster CRs.
Does it mean the applications will be sharded and distributed to these small clusters? In terms of sharding, for a certain application, will it be sharded to multiple clusters? or an application has to fit into one cluster?
This has different implications. In the latter case, it's actually still a blue/green upgrade from an application's point of view. If a RayService CR has one giant application, the upgrade still needs almost double resource.
If only one application is updated, when certain small cluster are upgraded, the other applications that co-locate in the same cluster will be affected and also updated. So basically we can't update one specific application, without touching others. Is that right?
Thanks.
@zzb54321 I think this plan would make one application fit into one cluster. it is not blue/green upgrade? Reference to k8s deployment, it like rolling upgrade to gradually upgrade all instances. Applications and ray clusters are one-to-one correspondence in my opinion. So there are no other applications that co-locate in the same cluster will be affected.
Search before asking
Description
Right now we don't support Ray cluster rolling upgrade. This is a valid requirement for customers that has a large number of nodes in their Ray cluster deployment.
Use case
support rolling upgrade of Ray clusters which can be beneficial to users with large Ray cluster.
Related issues
No response
Are you willing to submit a PR?