hustvl / Vim

[ICML 2024] Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Apache License 2.0
2.55k stars 160 forks source link

Vim configurations have double the number of transformer blocks as compared to timm's ViT/DeiT configurations #23

Open SarthakYadav opened 4 months ago

SarthakYadav commented 4 months ago

Thanks for the useful repo. I was going through the code, and upon inspection I saw that Vim-T and Vim-S configurations have double the number of blocks (depth=24) whereas both Tiny and Small configurations for ViT/DeiT in timm have depth=12. Is there a reason for this disparity?

CiaoHe commented 4 months ago

Since Mamba original paper said "One transformer layer ~ 2 Mamba block :