Open robertnishihara opened 1 year ago
cc @ericl who wrote the reproduction script
Looking into this, it seems that ray.util.multiprocessing
launches a fixed sized pool of actors depending on the cluster size on start:
ray_cpus = int(ray._private.state.cluster_resources()["CPU"])
if processes is None:
processes = ray_cpus
It should be possible to enhance the multiprocessing module to support autoscaling similar to Datasets's actor pool. cc @edoakes
I'll tag this as a P1 for core for now.
Customer LiveEO ran into this limitation. @jjyao
Hi - we're seeing the same issue when launching via joblib
on a head pod only Ray Cluster (which should start scaling) and wondering if it has been solved?
I've currently patched with the following:
ray_cpus = int(ray._private.state.cluster_resources().get("CPU") or processes)
Hi @rupertcw, thanks for reporting. Looks like you already have a workaround? If you could provide more information regarding your use-case and its need for this feature, it will help us prioritize. cc: @jjyao @anyscalesam
hi @ruisearch42 - thanks for following up.
The scenario is the following:
num-cpus = 0
(to avoid running jobs on head pod)num-cpus = 1
in the remote args and expect the head pod to automatically scale the cluster to run all these jobs.num-cpus = 1
-> 200 jobs (not counting the extra init and end_batch)When using this setup, ray._private.state.cluster_resources()["CPU"]
doesn't have the information to get a CPU count and so it fails.
All I did was to use processes
instead (this would be 200
in my example) so that it scales by that number - seems to work for now.
What happened + What you expected to happen
I'm running the reproduction script included below.
The cluster adds a node (presumably due to autoscaling), but the second node never gets utilized presumably because the actors are placed on the first node and can't easily be moved around. [I didn't verify that that's the case, just speculating.]
Versions / Dependencies
Ray 2.2 Python 3.9
Reproduction script
Issue Severity
None