Closed XapaJIaMnu closed 3 years ago
@XapaJIaMnu - "not completed event is destroyed" error may be caused because there is no ".wait()" call for ccl::allgather https://github.com/XapaJIaMnu/marian-dev/blob/d33cea1d649186242c244f6a11d599be68f3499c/src/training/communicator_oneccl.h#L226
Allgatherv supports usage of the same buffer as send/recv buffer like it is provided by MPI, but there is no special keyword like MPI_IN_PLACE. Recv buffer should contain send data of current rank, with corresponding offset. The same recv buffer should be passed twice in allgatherv, e.g. like here https://github.com/oneapi-src/oneCCL/blob/master/examples/sycl/sycl_allgatherv_inplace_test.cpp#L116
@mshiryaev thank you very much, this was entirely user error. Putting the wait call removed the error message, and putting the recvbuf
as opposed to the partially overlapping sendbuf
fixed the issue. NCCL
works correctly with partially overlapping buffers and the mistake happened when porting from there.
Hey,
I'm using oneCCL to implement multinode communication in the marian machine translation toolkit. I am having a problem with a call to
ccl::allgatherv
. As far as I can tell according to the documentation there is no restriction on buffer overlapping, however if I don't use a temporary buffer onto which I copy the sendbuffer as shown here: https://github.com/XapaJIaMnu/marian-dev/blob/d33cea1d649186242c244f6a11d599be68f3499c/src/training/communicator_oneccl.h#L226I get a crash like this:
Otherwise, if I do use the workaround, I get correct behaviour, however every call to
allgatherv
is supplemented by the followingstderr
output:Any suggestions?
Cheers,
Nick