Open lianqing11 opened 4 years ago
We have no plan to support it ATM, as this is the first time it came up. We have the primitives (all_gather, all_reduce, ...) to support it:
But there are ops in there that I never heard (batch_norm_gather_stats_with_counts, batch_norm_backward_reduce, ...) that might not be supported (meaning, they will go to pytorch/CPU instead of staying within the XLA world). Also there are things like the line below, which should be avoided since they exit Lazy Tensor graph mode and trigger an execution:
Hi dlibenzi,
Thanks for your reply. I will try it.
Best regards, Qing LIAN
Have you had any success? Would you mind sharing the code? I also need sync batch norm.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I would be interested if this is included. SyncBN was useful to some Kaggle competitors in a recent competition and it sounds like it's really useful. Would love to see support for it in PyTorch XLA.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@dlibenzi This issue was closed. Is this not something you plan to work on? SyncBN is something quite important and somewhat commonly used for multi-GPU training. It synchronizes the batch normalization statistics across the multiple GPUs. Won't something that synchronizes across multiple TPU cores also be useful?
I will reopen this issue add add it to our list of ops to lower.
any update on the timeline for supporting SyncBatchNorm?
Hi,
I'm trying to use xla to run the experiments on ImageNet dataset and hope to use syncbn. I can not find any resources about syncbn in torch/xla. Will xla team support the syncbn feature or should I write it by myself?
Best regards, Qing LIAN