ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.62k stars 5.71k forks source link

[Potential Bug] Spilling of tasks on single node cluster #21173

Closed czgdp1807 closed 2 years ago

czgdp1807 commented 2 years ago

Search before asking

Ray Component

Ray Clusters

What happened + What you expected to happen

AFAICT, this issue is regarding spilling of tasks on a single cluster node. So, what happens is if there is only one node in a cluster then due to lack of resources an attempt is made to spill the remaining tasks on other nodes. This attempts will fail because there is only one node.

Therefore, instead of doing the above a warning should be shown to the end users that autoscaler is not running and some tasks are getting delayed/not getting scheduled due to lack of resources.

Following are some quotes from #20891 ,

https://github.com/ray-project/ray/issues/20891#issuecomment-996923485

This looks like a bug, if there is only a single node in the cluster, tasks shouldn't be spilled, right? Who on the core team should help with this/knows this code the best? If actors cannot be scheduled due to lack of CPUs (and the actor requires a CPU), I would expect a message to the user stating so, unless the cluster has an autoscaler running.

https://github.com/ray-project/ray/issues/20891#issuecomment-997049442

can you open a separate one for the investigation about the warning for users if actors are waiting to be scheduled for an extended period of time due to lack of resources?

Reference comment - https://github.com/ray-project/ray/issues/20891#issuecomment-996679034

cc: @pcmoritz @scv119 @stephanie-wang @rkooo567

Versions / Dependencies

Version - Master branch (Commit - f5dfe6c1580443b3ce0a149b4e7f9d108cf43b6d)

Python - 3.8.11

OS - Windows 10 Pro (8 Virtual Processors and 16 GB RAM).

Reproduction script

import time

import ray

num_cpus_actor = # some value here
num_cpus_ray = # some value here also
num_actors = # some value here

@ray.remote(num_cpus=num_cpus_actor)  # tried different values
class Actor:
    def __init__(self):
        self.num = 0

if __name__ == '__main__':
    num_cpus = num_cpus_ray
    try:
        ray.init(num_cpus=num_cpus)
        time.sleep(1)
        for i in range(num_actors):
            print(f"Starting actor {i}")
            Actor.options(name=str(i + 1), lifetime="detached").remote()
            time.sleep(1)
        time.sleep(10)
    except Exception as e:
        print("Exception {} ".format(str(e)))
    finally:
        ray.shutdown()

Satisfy the following condition,

num_cpus_ray < num_actors * num_cpus_actor Only one node should be used (Default setting of Ray though).

Anything else

No response

Are you willing to submit a PR?

rkooo567 commented 2 years ago

Let me take a look if this makes sense this Sun

pcmoritz commented 2 years ago

It sounds like the warning about pending actors that cannot be scheduled is still printed in this case, @czgdp1807 can you post where it is printed and how it looks?

Not sure if the spilling should be invoked on a single node cluster, but if the warning is printed and there are no other negative side effects maybe it is ok and we can close this issue?

czgdp1807 commented 2 years ago
(scheduler +26s) Warning: The following resource request cannot be scheduled right now: {'CPU': 10.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
Starting actor 10
2021-12-21 19:11:52,494 WARNING worker.py:1257 -- It looks like you're creating a detached actor in an anonymous namespace. In order to access this actor in the future, you will need to explicitly connect to this namespace with ray.init(namespace="ec3e2076-7749-4dbd-bba6-cc9859ff32a3", ...)

@pcmoritz I get the above warning with the following code,

import time

import ray, sys

print(ray.__version__)

num_cpus_actor = 10
num_cpus_ray = 50

@ray.remote(num_cpus=num_cpus_actor)  # tried different values
class Actor:
    def __init__(self):
        self.num = 0

if __name__ == '__main__':
    cluster_mode = False
    num_cpus = num_cpus_ray
    try:
        ray.init(num_cpus=num_cpus)
        time.sleep(1)
        for i in range(50):
            print(f"Starting actor {i}")
            Actor.options(name=str(i + 1), lifetime="detached").remote()
            time.sleep(1)
        time.sleep(10)
    except Exception as e:
        print("Exception {} ".format(str(e)))
    finally:
        ray.shutdown()
stale[bot] commented 2 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 2 years ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!