Open akdienes opened 3 months ago
This would be extremely useful. Would love to know if this would be possible
cc @kevin85421
Do you use KubeRay? In KubeRay, you can directly edit the min / max replicas of worker groups.
I do not use KubeRay, just the vanilla autoscaler
I would recommend using KubeRay instead if you are able to launch a K8s cluster.
that sounds like quite a complicated change to make to my existing (working) setup. also, I'm not sure how it solves the problem? If I'm understanding correctly I'd still have to make changes to min_replicas
in a config file and redeploy --- but that's already possible by setting min_workers
in my cluster config yaml and restarting the head node.
what I'm looking for is something in the python API that I can call during an interactive session to reserve some workers to keep alive; kind of like how creating a placement group does (except unlike reserving a placement group, those resources would not be blocked from taking on tasks from the global pool)
If I'm understanding correctly I'd still have to make changes to min_replicas in a config file and redeploy --- but that's already possible by setting min_workers in my cluster config yaml and restarting the head node.
You don't need to restart the head node in KubeRay. The config change can be detected dynamically.
what I'm looking for is something in the python API that I can call during an interactive session to reserve some workers to keep alive;
Maybe this developer API can fulfill your requirements, but this API is designed for Ray library developers (e.g., Ray Train, Ray Serve, etc.) and advanced users.
looking at ray.autoscaler.sdk.request_resources
the following questions come to mind:
Description
I would like a way to ask my cluster to keep a bunch of nodes alive without having to pass a placement group to all my jobs; ideally I could interactively spin up a bunch of nodes and let them sit until I close the connection. if I reserve a big placement group then I have to pass that placement group around, otherwise all those resources will be considered unavailable. and if I set
min_nodes
in the config then they will be always alive whether or not I am actually using the cluster at that moment.Use case
No response