Open luyvlei opened 2 years ago
@luyvlei ohh yes, i do believe you are correct, and this paper also came to a similar conclusion https://arxiv.org/abs/2101.07525
Do I need to do a test to verify this modification? If this modification is effective, I can submit a PR and test report. @lucidrains
@luyvlei so i think the issue is because the batchnorm statistics are already a moving average - i'll have to read the momentum squared paper above in detail and see if the conclusions are sound
as an aside, there are papers that are starting to use SimSiam (kaimings work where the teacher is the same as the student, but with a stop gradient) successfully, and which does not require exponential moving averages as does BYOL. so i'm wondering how important these little details are, and whether it is worth the time to even debug
https://arxiv.org/abs/2111.00210 https://arxiv.org/abs/2110.05208
Hi all, I am also trying to reproduce BYOL results and am falling a bit short (~1%) and am wondering if this might be related to the reason why.
I figure there are two options during pretraining:
I believe # 1 is correct based on my reading of the paper and looking through some implementations. If #1 is correct, there don't need to be any changes -- also since we feed the same exact images to the target and online network, the running mean and running var calculated should be the same in the end.
If # 2 is correct, then we would have to copy the buffers as suggested above.
As an aside, I believe the issue in my repro is that I am following # 1 and have SyncBatchNorm for the online network, but not for the target network.
@lucidrains @luyvlei
Yep, if the target model uses train mode, the statics of BN doesn't matter since it will never be used. And in this implementation it also use train mode. But it's not clear that the EVAL mode would have yielded any better results @iseessel
The following function apply moving average to the ema model. But it didn't update the statistic(runing_mean and runing_var) since these two were not parameters but buffers.
Should I use this function instead?