Closed feras-oughali closed 3 years ago
In this comment it is forwarded to an issue in the original repository of the paper. There it is explained that the reason behind this operation is to obtain the same distribution of the unlabeled and labeled examples in all GPUs. Without interleave the batchnorm will operate on different distributions leading to inconsistent moment.
Thanks
I can get the point of using interleave when performing multi-gpu training. But here as no DataParallel is involved, the input would not be scattered to different gpus in the forward pass. As for DistributedDataParallel, the batch is already allocated by the DistributedSampler. So Interleave labeled+unlabeled on a single gpu seems redundant?
Just wanted to know the intuition behind the interleave and deinterleave operations. How does this help?