About reproducing MAE with ViT-S

keyu-tian / SparK

[ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"

https://arxiv.org/abs/2301.03580

MIT License

1.41k stars 82 forks source link

About reproducing MAE with ViT-S #41

Closed LinB203 closed 1 year ago

LinB203 commented 1 year ago

I notice that you reproduce the result of MAE with ViT-S by using official codes, but there is no decoder design in the official codes. I wonder that how many layers you use in decoder of ViT-S, also decoder_dim and num_head?

keyu-tian commented 1 year ago

We used patch_size=16, embed_dim=384, depth=12, num_heads=6, decoder_embed_dim=256, decoder_depth=8, decoder_num_heads=8, mlp_ratio=4.

LinB203 commented 1 year ago

clearly, thx.

LinB203 commented 1 year ago

What's about the pre-train script and fine-tune script? and I guess that the result of ViT-S is produced by the original code, just changed training scripts. Am I right? Can I use your scripts to reproduce the result of ViT-S?

keyu-tian commented 1 year ago

@LinB203 Yes you should use mae's official code.

LinB203 commented 1 year ago

the pre-train script and fine-tune script?

But there is no pre-train script and fine-tune script for ViT-S. Are the scripts same with ViT-B?

keyu-tian commented 1 year ago

Yes, we just modify the architecture.