ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.44k stars 5.67k forks source link

[autoscaler] dynamically set `min_nodes` during interactive sessions #47248

Open akdienes opened 1 month ago

akdienes commented 1 month ago

Description

I would like a way to ask my cluster to keep a bunch of nodes alive without having to pass a placement group to all my jobs; ideally I could interactively spin up a bunch of nodes and let them sit until I close the connection. if I reserve a big placement group then I have to pass that placement group around, otherwise all those resources will be considered unavailable. and if I set min_nodes in the config then they will be always alive whether or not I am actually using the cluster at that moment.

Use case

No response

0xinsanity commented 1 month ago

This would be extremely useful. Would love to know if this would be possible

anyscalesam commented 1 month ago

cc @kevin85421

kevin85421 commented 1 month ago

Do you use KubeRay? In KubeRay, you can directly edit the min / max replicas of worker groups.

akdienes commented 1 month ago

I do not use KubeRay, just the vanilla autoscaler

kevin85421 commented 1 month ago

I would recommend using KubeRay instead if you are able to launch a K8s cluster.

akdienes commented 1 month ago

that sounds like quite a complicated change to make to my existing (working) setup. also, I'm not sure how it solves the problem? If I'm understanding correctly I'd still have to make changes to min_replicas in a config file and redeploy --- but that's already possible by setting min_workers in my cluster config yaml and restarting the head node.

what I'm looking for is something in the python API that I can call during an interactive session to reserve some workers to keep alive; kind of like how creating a placement group does (except unlike reserving a placement group, those resources would not be blocked from taking on tasks from the global pool)

kevin85421 commented 1 month ago

If I'm understanding correctly I'd still have to make changes to min_replicas in a config file and redeploy --- but that's already possible by setting min_workers in my cluster config yaml and restarting the head node.

You don't need to restart the head node in KubeRay. The config change can be detected dynamically.

what I'm looking for is something in the python API that I can call during an interactive session to reserve some workers to keep alive;

Maybe this developer API can fulfill your requirements, but this API is designed for Ray library developers (e.g., Ray Train, Ray Serve, etc.) and advanced users.

https://docs.ray.io/en/latest/cluster/running-applications/autoscaling/reference.html#ray-autoscaler-sdk-request-resources

akdienes commented 1 month ago

looking at ray.autoscaler.sdk.request_resources the following questions come to mind: