rossumai / keras-multi-gpu

Multi-GPU data-parallel training in Keras
MIT License
77 stars 20 forks source link

Possible to use NCCL for optimized inter-GPU communication? #3

Open bzamecnik opened 7 years ago

bzamecnik commented 7 years ago

NCCL claims to provide optimized collective operations for multi-GPU communication. It's available via TensorFlow as well. In our case we could use:

We could use TF NCCL operation tf.contrib.nccl.all_sum. It's all-reduce with sum reduction, ie. reduce followed by broadcast of the result. We can use that for gradient averaging. The gradients are available on all devices so that weights can be located and updated on all devices and do not need to be broadcast.

Operation all-scatter is not provided in tf.contrib.nccl. Instead we could utilize TF queue mechanism.

ppwwyyxx commented 7 years ago

First note that there are at least 2 different strategies of data-parallel training. 1. Put a copy of all weights on each GPU, and apply gradients to all the copies of weights; 2. Distribute the only copy of weights among all GPUs by load balancing, etc, so that each GPU only has a subset of weights.

From my experience NCCL is helpful for (1) but not (2). Because NCCL all_sum end up having the sum of gradients available on all GPUs, but in (2) you only need it on one GPU (the one which owns the weight).

It depends on the characteristics of the model to tell which strategy is faster. For example I found (2) is slightly better for ResNet on ImageNet, but (1) is better for FasterRCNN on COCO.