Closed stevenhubhub closed 1 year ago
Yea, try to do cluster_resources().get("CPU", 0)
Yea, try to do
cluster_resources().get("CPU", 0)
Thanks!
Good morning, I have the same issue while trying to use Ray on Databricks cluster (with autoscaling). The runtime used is 15.1ML. I imported the following libraries:
from ray.util.spark import setup_ray_cluster, shutdown_ray_cluster
from ray.util.multiprocessing import Pool
import ray
This is how I set up the environment:
setup_ray_cluster(
num_worker_nodes = 4,
num_cpus_per_node = 4,
autoscale = True
)
ray.init(ignore_reinit_error = True)
then I decorated a function to be run in parallel by using @ray.remote
and tried to create a pool of processes this way:
with Pool(processes = 8) as pool:
pool.starmap(foo, inputs)
How can I solve this problem? Thank you for your help.
What happened + What you expected to happen
What I will do: I tried to get the total number of cpus provided by the cluster;
What I got wrong: The specific error information is as follows: {cluster_resources()['CPU']} CPU resources in total; KeyError: 'CPU'
Update: I seem to have found the reason, when there is no available cpu in the cluster, the 'CPU' key is no longer in the returned dict; This leads to errors;
Versions / Dependencies
ray: 2.3.1 os: debian 11 python: 3.9.2
Reproduction script
from ray import init, cluster_resources init() print(f"{cluster_resources()['CPU']}")
Issue Severity
High: It blocks me from completing my task.