rapidsai / distributed-join

Other
19 stars 12 forks source link

Add NCCL Communicator #27

Closed gaohao95 closed 4 years ago

gaohao95 commented 4 years ago

This PR intends to

This PR is built on top of #24.

gaohao95 commented 4 years ago

Throughput with cuDF 0.14 unmodified, with 100 millions rows/table/GPU, on a single-node DGX-1

#GPUs UCX default UCX buffer communicator NCCL
4 52.03 46.55 18.03
8 8.40 52.24 30.51
gaohao95 commented 4 years ago

To improve NCCL's performance, we use 128-bit aligned communication buffer in https://github.com/rapidsai/distributed-join/pull/27/commits/297289bc84dd1575d4dab89ee1d8cf708e1e56fe.

Throughput with cuDF 0.15 unmodified, with 100 millions rows/table/GPU, on DGX-1

#GPUs UCX default UCX buffer communicator NCCL
4 52.67 50.39 46.21
8 7.38 51.00 45.96
32 N/A 86.79 78.14
gaohao95 commented 4 years ago

Throughput with "hacky" cuDF with identity hash function, with 100 millions rows/table/GPU, on DGX-1

#GPUs UCX default UCX buffer communicator NCCL
4 55.65 51.55 51.32
32 N/A 150.41 130.93

I have noticed that increasing NCCL_BUFFSIZE to 16777216 can yield slightly better performance to 133.96GB/s for 32 GPUs. In comparison, the GTC number on 32 GPUs is ~146GB/s.