Open Catch-Bull opened 1 year ago
Hmm what's the proposal here? You are saying we should do FIFO chunk transfer instead of round robin?
This can have issues like many tasks that require small objs cannot be scheduled because of a task that requires large objects?
This can have issues like many tasks that require small objs cannot be scheduled because of a task that requires large objects?
@rkooo567 Sorry, I missed your comment.. so I am replying late. Actually, our ultimate goal is to resolve this issue. The details of the current issue are on the PR, and we can discuss them on the PR.
Description
Regarding the round-robin algorithm of the push manager in our scenario:
I think there were too many invalid chunk transfers. The scheduling of normal tasks is prone to conflicts, resulting in a large number of waiting tasks in the waiting task queue of a node. When these tasks simultaneously pull objects, their argument preparation time becomes similar, and only a few tasks can be dispatched to workers smoothly, while other tasks will be spilled out, leading to a waste of all these pull requests.
Here is a simple test:
Use case
round for object manager client and FIFO for object
prototype: https://github.com/ray-project/ray/pull/34269