[Ray Cluster][Autoscaler] Maximize concurrent requests to bring up 1-> many instances

jiaodong commented 1 year ago

Description

From user call who trains on TPU on GCP, they observed expected time to bring up 0->1 instances (headnode) but 1-> many seems more linear than parallelized. Ideally we should facilitate parallelized requests to these instances to speed up full static cluster startup time.

Use case

No response

zygi commented 1 year ago

Hi! I don't think this is strictly a feature request, but I'll just describe our workflow and the problems we're facing, and maybe you guys can figure out a way to improve this.

So, we want to spin up GCP TPU pod slices using the autoscaler, and so we wrote a custom autoscaler provider for this. The way TPU pod slices work is you create a single logical GCP instance, but that instance gives you e.g. 16 vms. We want to have each of those 16 vms be a ray node.

This means that the spin up/down procedure gets order-dependent. Specifically, when the provider tries to create the 1st ray node, it needs to tell GCP to create the whole tpu pod slice instance. Then, after that is done, to create the 2nd-16th nodes it just has to (in our implementation) update the metadata of the pod slice instance to say that the shard is occupied.

Which is fine, and it works in principle, but it makes spawning these nodes order-dependent so autoscaling must be done sequentially: you can't start spawning Ray nodes 1 and 2 at the same time, because node 2 needs the tpu pod slice instance to already exist.

We fixed this by setting the provider.foreground_node_launch = true parameter in the cluster config, but ideally we'd like to control this more in some way. Clearly only the 1st node needs to be spun up sequentially, and the other 15 can then be done in parallel.

Again, I'm not sure how Ray itself can best accommodate this since it's related to a particular implementation of a particular cloud provider. But maybe you'll have some ideas.

jiaodong commented 1 year ago

cc: @DmitriGekhtman this is one of the large model training users we're engaging with who might come visit us someday. Do you mind taking a closer look at this and we can discuss the actionable items needed ?

DmitriGekhtman commented 1 year ago

@zygi I've followed a similar strategy to provision Ray on TPU pods in this "hackathon" PR

I agree that it is non-trivial to figure out how to get the autoscaler to provision TPU pods in a natural, hack-free, way. The autoscaler does not have a concept of multi-Ray-node instances.

Here's a repo with scripts indicating how to set up Ray on TPUs manually https://github.com/JiahaoYao/ray-jax-tpu-pod-demos.

Do you see a fundamental benefit to using the autoscaler to provision Ray on TPU pods? Or would a well-documented manual approach describing how to provision a statically-sized cluster with TPU pods work?

jiaodong commented 1 year ago

In addition to what @DmitriGekhtman pointed out, my understanding so far is LLM users don't actually need autoscaler, but rather a static sized cluster that works well and provides abstraction needed to facilitate their placement group / model parallelism policies.

@zygi feel free to correct us if our understanding is inaccurate. We're actively working with https://github.com/alpa-projects/alpa/blob/main/alpa/device_mesh.py that provides static / flexible definition of devices in a ray cluster for HPC / LLM workloads, and I'm interested to learn more if that aligns better with what you're looking for.

zygi commented 1 year ago

Hmm, but I think we are in fact interested in the autoscaler! TPU nodes are expensive so we'd prefer for them to be down when we're not running anything, but also it would be extremely convenient to be able to just run ray job submit -- python train_tpu_model.py and have the tpus automatically provisioned and deprovisioned.

jiaodong commented 1 year ago

@zygi iiuc you're looking for job level ray cluster lifecycle management ? Would it be sufficient to have ray job submit -- python train_tpu_model.py that spawns ephemeral ray cluster on demand and shutdown when your script reaches termination state ?

Or -- dynamically resizing the same TPU cluster rather than static size is an actual valuable feature you want to use and want to see improvements on.

zygi commented 1 year ago

Would it be sufficient to have ray job submit -- python train_tpu_model.py that spawns ephemeral ray cluster on demand and shutdown when your script reaches termination state ?

for production yes, but I imagine that would be hell for development -- waiting 3-5 minutes to see if your code runs would be bad, especially for a language like python. So I'd be against ephemeral clusters, whether they're static or dynamic.

Now in principle I'd be quite happy with a non-ephemeral static cluster -- even disregarding autoscaling, one of Ray's advantages should be to enable rapid prototyping, without needing to re-setup the cluster between command executions. However, AFAIK none of the problems we encountered were caused by dynamic autoscaling -- the problems would remain if the TPU configuration was static.

Or -- dynamically resizing the same TPU cluster rather than static size is an actual valuable feature you want to use and want to see improvements on.

Kind of. To clarify, we don't want to (and in fact, can't) resize individual tpu slices, but we do want to resize how many instances of those tpu slices to run. Imagine a "ray tune" like workflow -- we define a hyperparameter grid and want to run training over it. Ideally, I'd want to just define the train function and tell Ray to run it over all the hyperparam combos -- and Ray would just do it, spawning additional tpu slice instances if that would be helpful.

jiaodong commented 1 year ago

Thanks for the info @zygi ! This is very clear now and I completely agree with the dev ux you described. We have productionzied this experience called "workspaces" in Robert's demo from recent Ray Summit keynote, and the gist of it is interactive development on a running cluster.

We don't have 100% matching experience in OSS however, but there're some pieces that made the essence of it:

Interactive ray development - Ray Client https://docs.ray.io/en/latest/cluster/running-applications/job-submission/ray-client.html Seamlessly sync working_dir across a cluster - https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#id1

Sharp edges to be aware of:

Ray Client is not actively developed component that we encourage external folks to use job submission instead
Job submission has strong assumption of ephemeral ray cluster such that repeatedly submit jobs to the same ray cluster can lead to weird states due to lack to proper isolation
Autoscaler is arguably one of the less reliable component in Ray. IMHO, I had much better dev efficiency creating static cluster of a size that I know of (even for tune, like 2x the resource then aggressively shutdown) rather than expecting autoscaler will behave the way I expect within reasonable time.

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 1 year ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

ray-project / ray

[Ray Cluster][Autoscaler] Maximize concurrent requests to bring up 1-> many instances #28546

Description

Use case