Open stephanie-wang opened 4 months ago
p2 for now (until we need more perf features)
I'm interested in this task and would love to work on it.
with InputNode() as inp:
dag = sender.send.bind(shape, dtype, inp)
dag = dag.with_type_hint(TorchTensorType(shape, dtype))
o1 = receiver1.recv.bind(dag)
o2 = receiver2.recv.bind(dag)
dag = MultiOutputNode([o1, o2])
This should call ncclBroadcast
under the hood as there are multiple receivers getting a tensor sent by a single sender. @stephanie-wang, is my understanding accurate?
actually this one I want to hold a little bit before we have more clear design. The reason is that broadcast is not always working well if all downstream tasks are not running at the same. For example;
with InputNode() as inp:
dag = sender.send.bind(shape, dtype, inp)
dag = dag.with_type_hint(TorchTensorType(shape, dtype))
dag2 = receiver1.long_running.bind(inp)
o1 = receiver1.recv.bind(dag)
o2 = receiver2.recv.bind(dag)
dag = MultiOutputNode([o1, o2, dag2])
In this case, receiver_1.recv starts after receiver1.long_running finishes, and because it is broadcasting receiver2.recv should wait, which is different from current semantics. I think we need more refined heuristic to support this instead of just using broadcasting for all cases
Got it, that makes a lot of sense. Do we have an estimated timeline for completing the design? Please let me know if there's anything I can help with!
I think following up with multi output ref (multi ray.get and skip deseriailzation) are good candidates!!
For this issue, if you'd like to take it, I think we need 2 followups;
Sounds good, I'll get started on the deserialization problem and get back to this issue after doing a bit more research.
Description
When the same GPU tensor is sent to multiple readers, we should use ncclBroadcast under the hood to reduce transfer time.
Use case
No response