[Feature] rolling upgrade design and implementation for Kuberay

wilsonwang371 commented 2 years ago

Search before asking

[X] I had searched in the issues and found no similar feature requirement.

Description

Right now we don't support Ray cluster rolling upgrade. This is a valid requirement for customers that has a large number of nodes in their Ray cluster deployment.

Use case

support rolling upgrade of Ray clusters which can be beneficial to users with large Ray cluster.

Related issues

No response

Are you willing to submit a PR?

[X] Yes I am willing to submit a PR!

DmitriGekhtman commented 2 years ago

This would be great to have. Let's figure out a design... Would we be aiming for something similar to rollouts for Deployments?

wilsonwang371 commented 2 years ago

Note: we need to review #231 and make a new design.

scarlet25151 commented 2 years ago

Would we be aiming for something similar to rollouts for Deployments?

@DmitriGekhtman I think we can separate the update for two role:

for head node, we can just delete old one and pull up a new one, here we need to consider how it interacts with the HA mechanism.
for worker node, yes we can use similar rolling update logic in the deployments, however there may be some difference like deployment do not support scale old version replicas to 0 so it keep a MaxUnavailable or MaxSurge but in ray we can have only head node with no worker, so we need to define some new behaviors

here me and Wilson will come up with detail design and we may have several round discussion.

DmitriGekhtman commented 2 years ago

I like the strategy of splitting the discussion (and potentially even implementation) into updates for head and updates for worker.

cc @brucez-anyscale for the head node HA aspect. Stating the question again: What should happen when you change the configuration for a RayCluster's head pod?

here me and Wilson will come up with detail design and we may have several round discussion.

That's great! I'm looking forward to discussing the design of this functionality -- I think it's very important.

brucez-anyscale commented 2 years ago

Right now. RayService does the whole cluster level upgrading, so RayService works itself for now. RayCluster rolling upgrade: I think the head node or worker node should be backward compatible, so they can join back the Ray cluster.

scarlet25151 commented 2 years ago

@wilsonwang371 Here I think we need to find the exact use case that user can benefit from the feature.

First is the user behavior, following the previous discussion, we can make the assumption that in this story:

the users would like to upgrade the image of worker.
the users would like to upgrade the image of head.
the user would like to upgrade both head and workers.

In all of those cases, we need to enable the mechanism that the ray package in images is compatible.

Here are some scenarios that I can think about:

there is no actor or task running on the raycluster

In this case, we would not need the feature since the recreate strategy would be enough, the only modification is to enable the worker upgrade in the reconcile.

there is some jobs running on the raycluster, some remaining actors inside the old one.

here the situation is a little bit tricky since we need to support mechanisms in ray that migrate actors from old raycluster to the new one.

there is a ray service running on the raycluster, just as @brucez-anyscale said the whole cluster would upgrade

This case is the most possible to have the rolling upgrade feature. Since for now we may recreate a brand new raycluster by rayservice controller. we may support the rolling upgrade in raycluster controller to ease the ray service upgrade.

Indeed we need support standard update semantic for raycluster, at least in recreate strategy. However, for now, consider those cases, would the feature raycluster rolling upgrade bring any further significant benefit to the user? WDYT @DmitriGekhtman

DmitriGekhtman commented 2 years ago

Let's first consider the most basic use-case that we were going for with the --forced-cluster-upgrade flag.

When a user updates a RayCluster CR and applies it, they expect changes to pod configs to be reflected in the actual pod configuration, even if the change is potentially disruptive to the Ray workload. If you update a workerGroupSpec, workers with outdated configuration should be eliminated and workers with update configuration should be created. Same thing for the HeadGroupSpec.

The ability to do (destructive) updates is available with the Ray Autoscaler's VM node providers and with the legacy python-based Ray operator. The implementation for this uses hashes of last-applied node configuration. We could potentially do the same thing here.

If Ray versions mismatch, things won't work out, no matter what, because Ray does not have cross-version compatibility. If workloads are running, they may be interrupted. These are complex, higher-order concerns, but we can start by just registering pod config updates.

grzesuav commented 1 year ago

Question - why to create pod's directly and not Deployment - which would handle this ? (side note - not familiar with ray in particular, just an operator of a kubernetes cluster where ray is deployed)

jhasm commented 1 year ago

I am curious if there has been any update on this feature or do we have any plans?

If we are worried that we do not have a strong use case to focus on this, I can help. Not have rolling upgrade is a real pain for us. I am talking from the perspective of a ML platform that supports all ML teams within a company.

We plan to have several ray clusters, standing and ephemeral. Think one ray cluster for model-dev (interactive), automated training, batch serving and real-time serving per project a group or project in one ML team.
For standing clusters, not having rolling upgrade is like going back by a few years in infrastructure, for us. Every service we have has rolling upgrade, and we do not allow downtimes in the production services.
For real-time serving (ray-serve) this is a blocker. The serving needs to be available 24x7, there is no acceptable downtime outside of SLA.
Since the project specific python dependencies are going to be baked into the image running on the workers, we will need to update this for every change. This does happen frequently for us, and having a scheduled downtime to do this out of the norm for our infrastructure.
Since Kuberay is at v0.5.0 we expect to keep up with the rapid version upgrades. And this will require us to delete and recreate all our ray clusters.
Deleting a resource and recreating is not a standard CI/CD operation for us. This requires custom steps or manual support in our case. Deleting a resource manually is reserved for emergency situations, but the lack of rolling upgrade requires us to use it frequently.

I am happy to discuss more on this, or help any way I can.

peterghaddad commented 1 year ago

@jhasm I don't want to speak for others, but believe serve will be critical to ensure 100% uptime during upgrades of Ray Cluster versions. The way a model is served shouldn't hinder the upgrade i.e serve cli, sdk, etc. I had some thoughts I wanted to share.

There may be opportunities to enable cluster version rolling upgrades using Ray's GCS external Redis.

A potential starting point may be to detect when the Ray Cluster version changes. If the version changes and the cluster name is currently deployed, then launch a new Ray cluster. Once jobs are transferred, have kuberay rewrite the service to point to the new cluster. I believe the more complex portion is transferring the jobs and actors to the new cluster.

kevin85421 commented 1 year ago

Good point: https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1682018043124949?thread_ts=1681846159.725999&cid=C02GFQ82JPM

Keep the head service and serve service with the same name.

qizzzh commented 8 months ago

Any update on this? Lack of rolling-update is like a no-go for many production serving workloads.

DmitriGekhtman commented 8 months ago

The RayService custom resource is intended to support the upgrade semantics of the sort people in this thread are looking for.

An individual Ray Cluster should be thought of as a massive pod -- there is not a coherent way to conduct a rolling upgrade of a single Ray cluster (though actually some large enterprises have actually managed to achieve this)

tl;dr solutions for upgrades require multiple Ray clusters

In my experience, doing anything "production-grade" with Ray requires multiple Ray clusters and external orchestration.

kevin85421 commented 8 months ago

@qizzzh, I just saw your message. As @DmitriGekhtman mentioned, upgrading Ray involves more than one RayCluster. For RayService, we plan to support incremental upgrades, meaning that we won't need a new, large RayCluster for a zero-downtime upgrade. Instead, we will gradually increase the size of the new RayCluster and decrease the size of the old one. If you want to chat more, feel free to reach out to me on Slack.

Ray doesn't natively support rolling upgrade. It is impossible for KubeRay to achieve that (in the single RayCluster). This issue should move to Ray instead of KubeRay. Close this issue. I will open new issues to track incremental upgrade when I start to work on it.

zzb54321 commented 2 months ago

Hi @kevin85421 , is there any progress on this? or any tracking issue created? so we can check whether the incremental upgrade effort has been started or no. Thanks a lot!

andrewsykim commented 2 months ago

@zzb54321 there have been some discussions but no work started. I am willing to start a one-pager proposal on this effort. @kevin85421 any objections?

kevin85421 commented 2 months ago

@zzb54321 instead of an incremental upgrade, the community recently prefers an N+1 upgrade for now. See https://github.com/ray-project/kuberay/issues/2274 for more details.

kevin85421 commented 2 months ago

I am willing to start a one-pager proposal on this effort.

sounds good!

JoshKarpel commented 2 months ago

@kevin85421 what's an N+1 upgrade?

kevin85421 commented 2 months ago

what's an N+1 upgrade?

RayService manages multiple (N) small RayCluster CRs simultaneously. When we need to upgrade the RayService CR, it creates a new small RayCluster CR and then tears down an old RayCluster CR.

You can think of it like a K8s Deployment, where each Pod in the Deployment is a 1-node RayCluster. Then, use the K8s rolling upgrade mechanism to upgrade the K8s Deployment.

JoshKarpel commented 2 months ago

Gotcha! That makes a lot of sense. I'll follow https://github.com/ray-project/kuberay/issues/2274

Do you think it would be possible to set e.g. an environment variable to be different in each small cluster automatically? We've been thinking about sharding our current one large cluster into multiple smaller clusters to handle increasing scale (probably roughly what is being referred to here) - it would be nice if we could do that via this mechanism so that we didn't have to manage that ourselves!

kevin85421 commented 2 months ago

@JoshKarpel Would you mind explaining why you need to have different environment variables for different small RayCluster CRs? For the short term, I plan to make RayService more similar to a K8s Deployment (where each Pod has the same spec) instead of a K8s StatefulSet. That is, I prefer to make all RayCluster CRs that belong to the same RayService CR have the same spec. If we make it stateful, I think the complexity will increase a lot.

JoshKarpel commented 2 months ago

Oh, sorry, yes, I should have said why!

Our goal here would be to shard a set of dynamically-created Serve applications (reconciled with our ML model store) across multiple clusters. Right now, we deploy the Serve applications from inside the cluster itself, so each cluster would need to know which shard it should be (e.g., to then do consistent hashing on the metadata that defines the Serve apps, so it knows which apps to create in itself).

We don't deploy the apps through the RayService CR because we don't want KubeRay to consider them when determining the health of the cluster (see https://github.com/ray-project/ray/issues/44226).

That said - short term, your plan totally makes sense, and I agree that it will be much simpler! Once we have that maybe we can work on extending it to add some stateful-ness. By then maybe we'll have played with it in our setup and have something we could upstream.

zzb54321 commented 2 months ago

A few question about N+1 upgrade. Assume the RayService CR defines multiple applications. In the context of N+1 upgrade, there will be multiple small RayCluster CRs.

Does it mean the applications will be sharded and distributed to these small clusters? In terms of sharding, for a certain application, will it be sharded to multiple clusters? or an application has to fit into one cluster?
This has different implications. In the latter case, it's actually still a blue/green upgrade from an application's point of view. If a RayService CR has one giant application, the upgrade still needs almost double resource.
If only one application is updated, when certain small cluster are upgraded, the other applications that co-locate in the same cluster will be affected and also updated. So basically we can't update one specific application, without touching others. Is that right?

Thanks.

Basasuya commented 2 months ago

@zzb54321 I think this plan would make one application fit into one cluster. it is not blue/green upgrade? Reference to k8s deployment, it like rolling upgrade to gradually upgrade all instances. Applications and ray clusters are one-to-one correspondence in my opinion. So there are no other applications that co-locate in the same cluster will be affected.

ray-project / kuberay