Open pspillai opened 2 months ago
Now torch-ccl only support one rank do sending and another do receiving at the same time, if you change the code to
if my_rank == 0:
o1 = dist.isend(A,1-my_rank)
o1.wait()
else:
o2 = dist.irecv(B,1-my_rank)
o2.wait()
It will works. Did you run your test with cuda's nccl. If it works with cuda, I think this is design issue of torch-ccl.
Yes, if the send and receive ordering is matched, it will work, but this causes the transmissions to be serialized, wasting half of the available bandwidth. (There should be no reason why the two transfers cannot be done concurrently).
I have not tested on nccl, however looking at the sample code for torch.distributed.batch_isend_irecv: https://pytorch.org/docs/stable/distributed.html#torch.distributed.batch_isend_irecv and the source code at: https://pytorch.org/docs/stable/_modules/torch/distributed/distributed_c10d.html#batch_isend_irecv it looks like the batch_isend_irecv is just calling the isend/irecv operations in the order provided, which in this example is the same for each rank. So I expect this to work fine on nccl.
Not surprisingly, batch_isend_irecv locks up with this example using ccl.
I am trying to implement a concurrent asynchronous send and receive between multiple processes. This results in deadlock. Minimum code to reproduce this is as follows:
Run with
This sounds like the isend and irecv on each process is serialized. This particular example can complete if one process does send first and the other recv first, but I think they are still being serialized, so the two transfers are not concurrent.
I tried to use batch_isend_irecv to define a list of transfers, but this resulted in the same deadlock.
Without concurrent transfers, it is almost impossible to implement efficient distributed compute and shift algorithms or Cannon's algorithms, etc.