Closed gaohao95 closed 4 years ago
Throughput with cuDF 0.14 unmodified, with 100 millions rows/table/GPU, on a single-node DGX-1
#GPUs | UCX default | UCX buffer communicator | NCCL |
---|---|---|---|
4 | 52.03 | 46.55 | 18.03 |
8 | 8.40 | 52.24 | 30.51 |
To improve NCCL's performance, we use 128-bit aligned communication buffer in https://github.com/rapidsai/distributed-join/pull/27/commits/297289bc84dd1575d4dab89ee1d8cf708e1e56fe.
Throughput with cuDF 0.15 unmodified, with 100 millions rows/table/GPU, on DGX-1
#GPUs | UCX default | UCX buffer communicator | NCCL |
---|---|---|---|
4 | 52.67 | 50.39 | 46.21 |
8 | 7.38 | 51.00 | 45.96 |
32 | N/A | 86.79 | 78.14 |
Throughput with "hacky" cuDF with identity hash function, with 100 millions rows/table/GPU, on DGX-1
#GPUs | UCX default | UCX buffer communicator | NCCL |
---|---|---|---|
4 | 55.65 | 51.55 | 51.32 |
32 | N/A | 150.41 | 130.93 |
I have noticed that increasing NCCL_BUFFSIZE
to 16777216 can yield slightly better performance to 133.96GB/s for 32 GPUs. In comparison, the GTC number on 32 GPUs is ~146GB/s.
This PR intends to
start
andstop
to theCommunicator
class so that it is compatible with NCCL's group calls.start
/stop
API.recv
fromCommunicator
API and the distributed join algorithm.Communicator
interface.NCCLCommunicator
.This PR is built on top of #24.