Open jianoaix opened 2 years ago
I was looking through a lot of this code last night and I think there's a bug in our resource accounting logic and this is actually the same underlying issue as https://github.com/ray-project/ray/issues/26751.
In particular I think this needs to do PlacementResources
if we're scheduling an actor. https://github.com/ray-project/ray/blob/master/src/ray/raylet/scheduling/cluster_task_manager.cc#L327
Run this on master, in a cluster with 20 nodes:
can see the distribution of actors got concentrated on a few nodes (3 nodes each with 5 actors in this case -- sometimes it's even more concentrated than this):
Note: if we add cpu requirement to actor, it worked.
@scv119