Support SyncBN in Pytorch/xla

pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)

https://pytorch.org/xla

Other

2.45k stars 461 forks source link

Support SyncBN in Pytorch/xla #2223

Open lianqing11 opened 4 years ago

lianqing11 commented 4 years ago

Hi,

I'm trying to use xla to run the experiments on ImageNet dataset and hope to use syncbn. I can not find any resources about syncbn in torch/xla. Will xla team support the syncbn feature or should I write it by myself?

Best regards, Qing LIAN

dlibenzi commented 4 years ago

We have no plan to support it ATM, as this is the first time it came up. We have the primitives (all_gather, all_reduce, ...) to support it:

https://github.com/pytorch/pytorch/blob/541814f2b7eacabacdc87ccb1b4495bf486f501a/torch/nn/modules/_functions.py#L6

But there are ops in there that I never heard (batch_norm_gather_stats_with_counts, batch_norm_backward_reduce, ...) that might not be supported (meaning, they will go to pytorch/CPU instead of staying within the XLA world). Also there are things like the line below, which should be avoided since they exit Lazy Tensor graph mode and trigger an execution:

https://github.com/pytorch/pytorch/blob/541814f2b7eacabacdc87ccb1b4495bf486f501a/torch/nn/modules/_functions.py#L33

lianqing11 commented 4 years ago

Hi dlibenzi,

Thanks for your reply. I will try it.

Best regards, Qing LIAN

matt-peters commented 4 years ago

Have you had any success? Would you mind sharing the code? I also need sync batch norm.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tmabraham commented 4 years ago

I would be interested if this is included. SyncBN was useful to some Kaggle competitors in a recent competition and it sounds like it's really useful. Would love to see support for it in PyTorch XLA.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tmabraham commented 4 years ago

@dlibenzi This issue was closed. Is this not something you plan to work on? SyncBN is something quite important and somewhat commonly used for multi-GPU training. It synchronizes the batch normalization statistics across the multiple GPUs. Won't something that synchronizes across multiple TPU cores also be useful?

JackCaoG commented 2 years ago

I will reopen this issue add add it to our list of ops to lower.

leonid-pishchulin commented 1 year ago

any update on the timeline for supporting SyncBatchNorm?