ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.54k stars 5.51k forks source link

Integrate with autoscaler to improve error message: The actor or task with ID [] is pending and cannot currently be scheduled #8326

Closed J1810Z closed 3 years ago

J1810Z commented 4 years ago

After setting up a new conda environment with ray, I am running into the issue that ray complains about insufficient resources. While sometimes the actors are still starting after 30s, most of the time my python program gets stuck at this point.

I am initializing ray from within my python script, which runs on a node scheduled by slurm. Access to CPUs and GPUs is limited via cgroups. psutil.Process().cpu_affinity() provides me with the correct number of available cores, which is higher than the necessary resources for ray. Interestingly, I didn't run into this issue in my previous conda environment.

The error message does not help much: 2020-05-05 17:17:32,657 WARNING worker.py:1072 -- The actor or task with ID ffffffffffffffff45b95b1c0100 is pending and cannot currently be scheduled. It requires {object_store_memory: 0.048828 GiB} for execution and {CPU: 1.000000}, {object_store_memory: 0.048828 GiB} for placement, but this node only has remaining {node:192.168.7.50: 1.000000}, {CPU: 28.000000}, {memory: 30.029297 GiB}, {GPU: 1.000000}, {object_store_memory: 10.351562 GiB}. In total there are 0 pending tasks and 3 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.