oneapi-src / oneCCL

oneAPI Collective Communications Library (oneCCL)
https://oneapi-src.github.io/oneCCL
Other
188 stars 67 forks source link

AllgatherV crashes when the buffers overlap #45

Closed XapaJIaMnu closed 3 years ago

XapaJIaMnu commented 3 years ago

Hey,

I'm using oneCCL to implement multinode communication in the marian machine translation toolkit. I am having a problem with a call to ccl::allgatherv. As far as I can tell according to the documentation there is no restriction on buffer overlapping, however if I don't use a temporary buffer onto which I copy the sendbuffer as shown here: https://github.com/XapaJIaMnu/marian-dev/blob/d33cea1d649186242c244f6a11d599be68f3499c/src/training/communicator_oneccl.h#L226

I get a crash like this:

2021:03:18-18:25:11:(64995) ERROR: |ERROR| host_event.cpp:33  ~host_event_impl not completed event is destroyed
backtrace() returned 11 addresses
./src/3rd_party/oneCCL/src/libccl.so(+0x1ea707) [0x7f0ecfcf6707]
./src/3rd_party/oneCCL/src/libccl.so(+0x1eacaf) [0x7f0ecfcf6caf]
./src/3rd_party/oneCCL/src/libccl.so(+0x1ead96) [0x7f0ecfcf6d96]
./marian(_ZNK6marian18OneCCLCommunicator15allGatherParamsEv+0xb34) [0x5597b1037c24]
./marian(_ZN6marian14SyncGraphGroup6updateESt6vectorISt10shared_ptrINS_4data5BatchEESaIS5_EEm+0x10fc) [0x5597b0fc0cdc]
./marian(_ZN6marian14SyncGraphGroup6updateESt10shared_ptrINS_4data5BatchEE+0x37a) [0x5597b0fc124a]
./marian(_ZN6marian5TrainINS_14SyncGraphGroupEE3runEv+0x8fc) [0x5597b0cc325c]
./marian(_Z11mainTraineriPPc+0xc9) [0x5597b0bf6729]
./marian(main+0x35) [0x5597b0bd4fa5]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f0ecc6bab97]
./marian(_start+0x2a) [0x5597b0bf28ca]
2021:03:18-18:25:11:(64996) ERROR: |ERROR| host_event.cpp:33  ~host_event_impl not completed event is destroyed
backtrace() returned 11 addresses
./src/3rd_party/oneCCL/src/libccl.so(+0x1ea707) [0x7fe469b68707]
./src/3rd_party/oneCCL/src/libccl.so(+0x1eacaf) [0x7fe469b68caf]
./src/3rd_party/oneCCL/src/libccl.so(+0x1ead96) [0x7fe469b68d96]
./marian(_ZNK6marian18OneCCLCommunicator15allGatherParamsEv+0xb34) [0x55debf7dac24]
./marian(_ZN6marian14SyncGraphGroup6updateESt6vectorISt10shared_ptrINS_4data5BatchEESaIS5_EEm+0x10fc) [0x55debf763cdc]
./marian(_ZN6marian14SyncGraphGroup6updateESt10shared_ptrINS_4data5BatchEE+0x37a) [0x55debf76424a]
./marian(_ZN6marian5TrainINS_14SyncGraphGroupEE3runEv+0x8fc) [0x55debf46625c]
./marian(_Z11mainTraineriPPc+0xc9) [0x55debf399729]
./marian(main+0x35) [0x55debf377fa5]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7fe46652cb97]
./marian(_start+0x2a) [0x55debf3958ca]
2021:03:18-18:25:11:(65005) ERROR: |ERROR| worker.cpp:288  ccl_worker_func worker 0 caught internal exception: oneCCL: allgatherv_entry.hpp:start:76: EXCEPTION: ALLGATHERV entry failed. atl_status: FAILURE
backtrace() returned 4 addresses
./src/3rd_party/oneCCL/src/libccl.so(+0x1ea707) [0x7fe469b68707]
./src/3rd_party/oneCCL/src/libccl.so(+0x3cef7) [0x7fe4699baef7]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fe4695626db]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fe46662ca3f]
[2021-03-18 18:25:11] Error: Unhandled exception of type 'N3ccl2v19exceptionE': oneCCL: allgatherv_entry.hpp:start:76: EXCEPTION: ALLGATHERV entry failed. atl_status: FAILURE
[2021-03-18 18:25:11] Error: Aborted from void unhandledException() in /home/nbogoych/marian-dev-master/src/common/logging.cpp:113

[CALL STACK]
[0x55debf30537f]                                                       + 0x1cf37f
[0x7fe466f44ae6]                                                       + 0x92ae6
[0x7fe466f44b21]                                                       + 0x92b21
[0x7fe4699badd3]                                                       + 0x3cdd3
[0x7fe4695626db]                                                       + 0x76db
[0x7fe46662ca3f]    clone                                              + 0x3f

Otherwise, if I do use the workaround, I get correct behaviour, however every call to allgatherv is supplemented by the following stderr output:

2021:03:18-18:22:38:(64194) ERROR: |ERROR| host_event.cpp:33  ~host_event_impl not completed event is destroyed
backtrace() returned 11 addresses
./src/3rd_party/oneCCL/src/libccl.so(+0x1ea707) [0x7f6991977707]
./src/3rd_party/oneCCL/src/libccl.so(+0x1eacaf) [0x7f6991977caf]
./src/3rd_party/oneCCL/src/libccl.so(+0x1ead96) [0x7f6991977d96]
./marian(_ZNK6marian18OneCCLCommunicator15allGatherParamsEv+0xb34) [0x560cdcb0dc24]
./marian(_ZN6marian14SyncGraphGroup6updateESt6vectorISt10shared_ptrINS_4data5BatchEESaIS5_EEm+0x10fc) [0x560cdca96cdc]
./marian(_ZN6marian14SyncGraphGroup6updateESt10shared_ptrINS_4data5BatchEE+0x37a) [0x560cdca9724a]
./marian(_ZN6marian5TrainINS_14SyncGraphGroupEE3runEv+0x8fc) [0x560cdc79925c]
./marian(_Z11mainTraineriPPc+0xc9) [0x560cdc6cc729]
./marian(main+0x35) [0x560cdc6aafa5]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f698e33bb97]
./marian(_start+0x2a) [0x560cdc6c88ca]

Any suggestions?

Cheers,

Nick

mshiryaev commented 3 years ago

@XapaJIaMnu - "not completed event is destroyed" error may be caused because there is no ".wait()" call for ccl::allgather https://github.com/XapaJIaMnu/marian-dev/blob/d33cea1d649186242c244f6a11d599be68f3499c/src/training/communicator_oneccl.h#L226

Allgatherv supports usage of the same buffer as send/recv buffer like it is provided by MPI, but there is no special keyword like MPI_IN_PLACE. Recv buffer should contain send data of current rank, with corresponding offset. The same recv buffer should be passed twice in allgatherv, e.g. like here https://github.com/oneapi-src/oneCCL/blob/master/examples/sycl/sycl_allgatherv_inplace_test.cpp#L116

XapaJIaMnu commented 3 years ago

@mshiryaev thank you very much, this was entirely user error. Putting the wait call removed the error message, and putting the recvbuf as opposed to the partially overlapping sendbuf fixed the issue. NCCL works correctly with partially overlapping buffers and the mistake happened when porting from there.