ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.29k stars 5.83k forks source link

[<Ray component: Cluster>] KeyError: 'CPU' error in Linux #37677

Closed stevenhubhub closed 1 year ago

stevenhubhub commented 1 year ago

What happened + What you expected to happen

What I will do: I tried to get the total number of cpus provided by the cluster;

What I got wrong: The specific error information is as follows: {cluster_resources()['CPU']} CPU resources in total; KeyError: 'CPU'

Update: I seem to have found the reason, when there is no available cpu in the cluster, the 'CPU' key is no longer in the returned dict; This leads to errors;

Versions / Dependencies

ray: 2.3.1 os: debian 11 python: 3.9.2

Reproduction script

from ray import init, cluster_resources init() print(f"{cluster_resources()['CPU']}")

Issue Severity

High: It blocks me from completing my task.

jjyao commented 1 year ago

Yea, try to do cluster_resources().get("CPU", 0)

stevenhubhub commented 1 year ago

Yea, try to do cluster_resources().get("CPU", 0)

Thanks!

davide-russo-tfs commented 2 months ago

Good morning, I have the same issue while trying to use Ray on Databricks cluster (with autoscaling). The runtime used is 15.1ML. I imported the following libraries:

from ray.util.spark import setup_ray_cluster, shutdown_ray_cluster
from ray.util.multiprocessing import Pool
import ray

This is how I set up the environment:

setup_ray_cluster(
        num_worker_nodes  = 4,
        num_cpus_per_node = 4,
        autoscale          = True
    )
ray.init(ignore_reinit_error = True)

then I decorated a function to be run in parallel by using @ray.remote and tried to create a pool of processes this way:

with Pool(processes = 8) as pool:
        pool.starmap(foo, inputs)

How can I solve this problem? Thank you for your help.