Does this support torch.nn.parallel.DistributedDataParallel?

vacancy commented 6 years ago

Currently not.

The implementation is designed for multi-GPU BatchNorm, which is commonly used for Computer Vision tasks. Thus, it uses NCCL for multi-GPU broadcasting and reduction.

Distributed version needs other types of synchronization primitive operations (e.g., via shared memory or cross-machine synchronization).

Contributions will be highly appreciated!

zhanghang1989 commented 6 years ago

Just let you know that PyTorch Compatible Synchronized Batch Norm is provided here http://hangzh.com/PyTorch-Encoding/index.html See the example here.

acgtyrant commented 6 years ago

@zhanghang1989 Does this support torch.nn.parallel.DistributedDataParallel?

acgtyrant commented 6 years ago

@zhanghang1989 Excuse me, I only see that you use the DataParallel, not the DistributedDataParallel. If you are sure, I use the DistributedDataParallel by myself later.

acgtyrant commented 6 years ago

BTW, the python notebook is 404.

vacancy commented 6 years ago

@zhanghang1989 Hi Hang. Thanks for the introduction. This repo aims at providing a standalone and easy-to-use version of sync_bn such that it can be easily integrated into any existing frameworks. Its implementation also differs from your previous implementation.

For an example use of the sync_bn, please check: https://github.com/CSAILVision/semantic-segmentation-pytorch

As for the DistributedDataParallel, currently, I have no plan for supporting it. @zhanghang1989 Have you tested your implementation in the distributed setting?

zhanghang1989 commented 6 years ago

Thanks @vacancy for the introduction! Nice work. I have plan of supporting distributed training, due to a recent paper use it in object detection. @acgtyrant The only thing need to be considered for distribute is making sure the number of gpus is set correctly https://github.com/zhanghang1989/PyTorch-Encoding/blob/master/encoding/nn/syncbn.py#L196

vacancy commented 6 years ago

@zhanghang1989 I am not an expert at this. But it seems to me that the DistributedDataParallel uses different implementation as DataParallel for broadcasting and reduction. While in a single process, multi-thread setting (DataParallel), one can use simple NCCL's broadcast and reduce, for tensors shared across multiple processes or even multiple machines, we need special implementation, which is defined in the package torch.distributed.

zhanghang1989 commented 6 years ago

@vacancy Thanks for the information.

acgtyrant commented 6 years ago

I read the source code of DistributedDataParallel, I find that it does not broadcast the parameters like DataParallel, it only all_reduce the gradient so that all model_replicates use the same gradient to optimize the same model at all, and all model_replicates use the same buffer which is broadcasted from the model in the device 0 of the rank 0 node.

@zhanghang1989 I think your syncbn use only comm.broadcast_coalesced and comm.reduce_add_coalesced which only support cross-gpu, it does not support cross-node which is distributed.

vacancy commented 6 years ago

Hi @acgtyrant

What do you exactly mean by "SynchronizedBatchNorm2d is not numerical stable only"?

acgtyrant commented 6 years ago

My tests show that my implemented SynchronizedBatchNorm2d is not numerical stable as your bn, there is not any other error. So now that your bn can work, my bn should work too. But it does not, I do not know why... So I need some help now.

acgtyrant commented 6 years ago

After fix the wrong view of the output, I retrain drn now, and it seems as expected.

2018-11-02-152419_1914x956_scrot

Cadene commented 6 years ago

@acgtyrant Is DistributedDataParallel working for you? Are you planning to send a Pull Request?

Thank you all for this beautiful work!

acgtyrant commented 6 years ago

My distributed synced bn works in DistributedDataParallel, but because of the secrecy laws from my boss, I delete the post which contains the source of the distributed synced bn, sorry.

However, it is easy to be implemented, the torch.distributed.all_reduce is synced automatically, so just warp it as an autograd function and use it to all reduce sum and sum_of_square, then use this function in your implemented synced bn module.

Spritea commented 5 years ago

@acgtyrant Is DistributedDataParallel working for you? Are you planning to send a Pull Request?

Thank you all for this beautiful work!

FYI, there is an open source repo created by NVIDIA, https://github.com/NVIDIA/apex, which supports SyncBN with DistributedDataParallel .

In fact, till now, it only supports SyncBN with DistributedDataParallel, and doesn't support SyncBN with DataParallel, see this issue. https://github.com/NVIDIA/apex/issues/115

qchenclaire commented 5 years ago

How about using torch.nn.parallel.DistributedDataParallel but only running on one node with 8 gpus? Does this repo work?

zhanghang1989 commented 5 years ago

How about using torch.nn.parallel.DistributedDataParallel but only running on one node with 8 gpus? Does this repo work?

For torch.nn.parallel.DistributedDataParallel please use torch.nn.SyncBatchNorm

vacancy / Synchronized-BatchNorm-PyTorch

Does this support torch.nn.parallel.DistributedDataParallel? #1