Possible to use NCCL for optimized inter-GPU communication?

rossumai / keras-multi-gpu

Multi-GPU data-parallel training in Keras

MIT License

77 stars 20 forks source link

NCCL claims to provide optimized collective operations for multi-GPU communication. It's available via TensorFlow as well. In our case we could use:

all-gather for gradient averaging (sum of gradients normalized by number of replicas)
broadcast for propagating weights
all-scatter for providing input slices to replicas

We could use TF NCCL operation tf.contrib.nccl.all_sum. It's all-reduce with sum reduction, ie. reduce followed by broadcast of the result. We can use that for gradient averaging. The gradients are available on all devices so that weights can be located and updated on all devices and do not need to be broadcast.

Operation all-scatter is not provided in tf.contrib.nccl. Instead we could utilize TF queue mechanism.

First note that there are at least 2 different strategies of data-parallel training. 1. Put a copy of all weights on each GPU, and apply gradients to all the copies of weights; 2. Distribute the only copy of weights among all GPUs by load balancing, etc, so that each GPU only has a subset of weights.

From my experience NCCL is helpful for (1) but not (2). Because NCCL all_sum end up having the sum of gradients available on all GPUs, but in (2) you only need it on one GPU (the one which owns the weight).

It depends on the characteristics of the model to tell which strategy is faster. For example I found (2) is slightly better for ResNet on ImageNet, but (1) is better for FasterRCNN on COCO.

rossumai / keras-multi-gpu

Possible to use NCCL for optimized inter-GPU communication? #3