Add NCCL Communicator - Githubissues

rapidsai / distributed-join

Other

19 stars 12 forks source link

Add NCCL Communicator #27

Closed gaohao95 closed 4 years ago

gaohao95 commented 4 years ago

This PR intends to

[x] Add start and stop to the Communicator class so that it is compatible with NCCL's group calls.
[x] Modify the distributed join algorithm for using the start/stop API.
[x] Since NCCL does not support probing, remove the unknown size recv from Communicator API and the distributed join algorithm.
[x] Using NCCL to implement Communicator interface.
[x] Test the performance using NCCLCommunicator.
[x] Add NCCL as an option to command line interface

This PR is built on top of #24.

gaohao95 commented 4 years ago

Throughput with cuDF 0.14 unmodified, with 100 millions rows/table/GPU, on a single-node DGX-1

#GPUs	UCX default	UCX buffer communicator	NCCL
4	52.03	46.55	18.03
8	8.40	52.24	30.51

gaohao95 commented 4 years ago

To improve NCCL's performance, we use 128-bit aligned communication buffer in https://github.com/rapidsai/distributed-join/pull/27/commits/297289bc84dd1575d4dab89ee1d8cf708e1e56fe.

Throughput with cuDF 0.15 unmodified, with 100 millions rows/table/GPU, on DGX-1

#GPUs	UCX default	UCX buffer communicator	NCCL
4	52.67	50.39	46.21
8	7.38	51.00	45.96
32	N/A	86.79	78.14

gaohao95 commented 4 years ago

Throughput with "hacky" cuDF with identity hash function, with 100 millions rows/table/GPU, on DGX-1

#GPUs	UCX default	UCX buffer communicator	NCCL
4	55.65	51.55	51.32
32	N/A	150.41	130.93

I have noticed that increasing NCCL_BUFFSIZE to 16777216 can yield slightly better performance to 133.96GB/s for 32 GPUs. In comparison, the GTC number on 32 GPUs is ~146GB/s.