Open acgtyrant opened 6 years ago
Just let you know that PyTorch Compatible Synchronized Batch Norm is provided here http://hangzh.com/PyTorch-Encoding/index.html See the example here.
@zhanghang1989 Does this support torch.nn.parallel.DistributedDataParallel?
@zhanghang1989 Excuse me, I only see that you use the DataParallel
, not the DistributedDataParallel
. If you are sure, I use the DistributedDataParallel
by myself later.
BTW, the python notebook is 404.
@zhanghang1989 Hi Hang. Thanks for the introduction. This repo aims at providing a standalone and easy-to-use version of sync_bn such that it can be easily integrated into any existing frameworks. Its implementation also differs from your previous implementation.
For an example use of the sync_bn, please check: https://github.com/CSAILVision/semantic-segmentation-pytorch
As for the DistributedDataParallel, currently, I have no plan for supporting it. @zhanghang1989 Have you tested your implementation in the distributed setting?
Thanks @vacancy for the introduction! Nice work. I have plan of supporting distributed training, due to a recent paper use it in object detection. @acgtyrant The only thing need to be considered for distribute is making sure the number of gpus is set correctly https://github.com/zhanghang1989/PyTorch-Encoding/blob/master/encoding/nn/syncbn.py#L196
@zhanghang1989 I am not an expert at this. But it seems to me that the DistributedDataParallel uses different implementation as DataParallel for broadcasting and reduction. While in a single process, multi-thread setting (DataParallel), one can use simple NCCL's broadcast
and reduce
, for tensors shared across multiple processes or even multiple machines, we need special implementation, which is defined in the package torch.distributed
.
@vacancy Thanks for the information.
I read the source code of DistributedDataParallel
, I find that it does not broadcast the parameters like DataParallel
, it only all_reduce the gradient so that all model_replicates use the same gradient to optimize the same model at all, and all model_replicates use the same buffer which is broadcasted from the model in the device 0 of the rank 0 node.
@zhanghang1989 I think your syncbn use only comm.broadcast_coalesced
and comm.reduce_add_coalesced
which only support cross-gpu, it does not support cross-node which is distributed.
Hi @acgtyrant
What do you exactly mean by "SynchronizedBatchNorm2d is not numerical stable only"?
My tests show that my implemented SynchronizedBatchNorm2d is not numerical stable as your bn, there is not any other error. So now that your bn can work, my bn should work too. But it does not, I do not know why... So I need some help now.
After fix the wrong view of the output, I retrain drn now, and it seems as expected.
@acgtyrant Is DistributedDataParallel working for you? Are you planning to send a Pull Request?
Thank you all for this beautiful work!
My distributed synced bn works in DistributedDataParallel, but because of the secrecy laws from my boss, I delete the post which contains the source of the distributed synced bn, sorry.
However, it is easy to be implemented, the torch.distributed.all_reduce is synced automatically, so just warp it as an autograd function and use it to all reduce sum and sum_of_square, then use this function in your implemented synced bn module.
@acgtyrant Is DistributedDataParallel working for you? Are you planning to send a Pull Request?
Thank you all for this beautiful work!
FYI, there is an open source repo created by NVIDIA, https://github.com/NVIDIA/apex, which supports SyncBN with DistributedDataParallel
.
In fact, till now, it only supports SyncBN with DistributedDataParallel
, and doesn't support SyncBN with DataParallel
, see this issue. https://github.com/NVIDIA/apex/issues/115
How about using torch.nn.parallel.DistributedDataParallel but only running on one node with 8 gpus? Does this repo work?
How about using torch.nn.parallel.DistributedDataParallel but only running on one node with 8 gpus? Does this repo work?
For torch.nn.parallel.DistributedDataParallel
please use torch.nn.SyncBatchNorm
Currently not.
The implementation is designed for multi-GPU BatchNorm, which is commonly used for Computer Vision tasks. Thus, it uses NCCL for multi-GPU broadcasting and reduction.
Distributed version needs other types of synchronization primitive operations (e.g., via shared memory or cross-machine synchronization).
Contributions will be highly appreciated!