Open bzamecnik opened 7 years ago
First note that there are at least 2 different strategies of data-parallel training. 1. Put a copy of all weights on each GPU, and apply gradients to all the copies of weights; 2. Distribute the only copy of weights among all GPUs by load balancing, etc, so that each GPU only has a subset of weights.
From my experience NCCL is helpful for (1) but not (2). Because NCCL all_sum
end up having the sum of gradients available on all GPUs, but in (2) you only need it on one GPU (the one which owns the weight).
It depends on the characteristics of the model to tell which strategy is faster. For example I found (2) is slightly better for ResNet on ImageNet, but (1) is better for FasterRCNN on COCO.
NCCL claims to provide optimized collective operations for multi-GPU communication. It's available via TensorFlow as well. In our case we could use:
We could use TF NCCL operation
tf.contrib.nccl.all_sum
. It'sall-reduce
with sum reduction, ie. reduce followed by broadcast of the result. We can use that for gradient averaging. The gradients are available on all devices so that weights can be located and updated on all devices and do not need to be broadcast.Operation
all-scatter
is not provided intf.contrib.nccl
. Instead we could utilize TF queue mechanism.