Closed LinB203 closed 1 year ago
We used patch_size=16, embed_dim=384, depth=12, num_heads=6, decoder_embed_dim=256, decoder_depth=8, decoder_num_heads=8, mlp_ratio=4.
clearly, thx.
What's about the pre-train script and fine-tune script? and I guess that the result of ViT-S is produced by the original code, just changed training scripts. Am I right? Can I use your scripts to reproduce the result of ViT-S?
@LinB203 Yes you should use mae's official code.
the pre-train script and fine-tune script?
But there is no pre-train script and fine-tune script for ViT-S. Are the scripts same with ViT-B?
Yes, we just modify the architecture.
I notice that you reproduce the result of MAE with ViT-S by using official codes, but there is no decoder design in the official codes. I wonder that how many layers you use in decoder of ViT-S, also decoder_dim and num_head?