pengzhiliang / MAE-pytorch

Unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners
2.6k stars 342 forks source link

Low Linear-Prob accuracy #65

Open launchauto opened 2 years ago

launchauto commented 2 years ago

Dear author I have reproduced your code using 64 V100 GPUs. Every setting is the same as paper (batch size 4096), The end-to-end finetuning is almost the same as paper. However, the linear prob is lower than expected in the paper. All of the experiments use normalized targets.

Arichtecture          epochs   end-to-end finetuning         linear probing
MAE-ViT-Base          1600       top1 83.186, top5 93.486      top1 53.64 , top5 77.32
MAE-ViT-Large         800        top1 85.320, top5 97.296      top1 67.45,  top5 87.13    
**according to paper MAE-ViT-Large linear probing top1 will be 73.9**

By the way, I used MoCo V3 2d position-embedding to replace 1d sin-cos position-embedding, which may help(MAE Vit-Base+0.3% in e2e finetuning and linear probing). MoCo V3

I also test your 400 epochs open MAE-ViT-Base model, the linear probing top1-accuracy is 50.91.

Did I miss something mentioned details in the paper?

For the parameters used in linear probing, I followed the setting in the appendix of the paper. optimizer LARS lr=6.4 batchsize=16384 weight_decay=0, momentum=0.9, cosine decay warmup epochs=10, total training epochs=90, only use random resize and crop as data augmentation Replaced the last layer norm with the Batch norm(affine=False) before the classifier. During the linear probing, I have frozen the backbone, only updating the fc+norm+mean pooling in the head of the classifier.

michuanhaohao commented 2 years ago

I got similar results. MAE+ViT-B+400ep: the linear probing top1-accuracy is 53.01.

optimizer adamw lr=0.016 batchsize=4096 weight_decay=0, cosine decay, mixup=0.0 cutmix=0.0, labelsmooth=0.0, warmup epochs=5, total training epochs=100 only use random resize, random flip and crop as data augmentation

leeyegy commented 2 years ago

Thanks for your sharing. According to your reproduction, the 1600epochs pretrained ViT-B has only 83.2 end-to-end finetune acc, which resulting in a 0.4 gap compared to the paper report. However, the 400epochs pretrained ViT-B has already achieved 83.1 end-to-end finetune acc. It seems that extra 1200 epochs longer pretrain brings negligible improvement, which is quite confusing. Do you have any ideas about it?

launchauto commented 2 years ago

negligible

Sorry, no idea.

launchauto commented 2 years ago

I got similar results. MAE+ViT-B+400ep: the linear probing top1-accuracy is 53.01.

optimizer adamw lr=0.016 batchsize=4096 weight_decay=0, cosine decay, mixup=0.0 cutmix=0.0, labelsmooth=0.0, warmup epochs=5, total training epochs=100 only use random resize, random flip and crop as data augmentation

Yeah, I use your linear-prob method and may get +0.33% when testing mae-large model. However, it is still much lower than expected.

mts42000 commented 2 years ago

I also tried to reproduce the linear probe results with no success. Interestingly, when I tried the non normalized loss during pretraining, the linear probe accuracy for the base config increased to 60% (still much lower than the expected 68%). With the normalized loss I also got 53.9% accuracy as you. Were you able to reproduce the linear probe results lately?

ShoufaChen commented 2 years ago

Hi, @launchauto, @michuanhaohao , @mts42000

Thanks for your efforts in reproducing the linear probe results.

I noticed that the official MAE repo has released the linear probe code. Thus, it is not hard to reproduce.

However, I was wondering did you find what caused the inconsistent performance compared with your original reproduction? I think there is not much difference between your configuration and the official configuration. However, the performance gap is very large.

Any help would be appreciated.