[Core] The actors got distributed to just a few nodes even with spread scheduling

ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

Apache License 2.0

34.17k stars 5.8k forks source link

Run this on master, in a cluster with 20 nodes:

import ray
import time

@ray.remote
class TestConsumingActor:
    def __init__(self, rank):
        self._rank = rank

    def consume(self, split):
        pass

num_workers = 20
splits = list(range(num_workers))

consumers = [
    TestConsumingActor.options(scheduling_strategy="SPREAD").remote(i)
    for i in range(num_workers)
]

future = [consumers[i].consume.remote(s) for i, s in enumerate(splits)]
ray.get(future)

can see the distribution of actors got concentrated on a few nodes (3 nodes each with 5 actors in this case -- sometimes it's even more concentrated than this):

Note: if we add cpu requirement to actor, it worked.

@ray.remote(num_cpus=1)
class TestConsumingActor:
    def __init__(self, rank):
        self._rank = rank

    def consume(self, split):
        pass

@scv119

ray-project / ray

[Core] The actors got distributed to just a few nodes even with spread scheduling #27577