rayleizhu / BiFormer

[CVPR 2023] Official code release of our paper "BiFormer: Vision Transformer with Bi-Level Routing Attention"
https://arxiv.org/abs/2303.08810
MIT License
460 stars 36 forks source link

one 7x7 conv vs. two 3x3 conv #4

Closed LMMMEng closed 1 year ago

LMMMEng commented 1 year ago

Thank you for your wonderful work!

It has been noted that some recent works used two 3x3 convs (stride=2) instead of one 7x7 conv (stride=4) as stem, is it because the latter can lead to better results?

rayleizhu commented 1 year ago

It seems that two 3x3 convs work better, according to uniformer’s choice:

https://github.com/Sense-X/UniFormer/blob/849cd0cd3b163f84102b1a799019a689d8d3fb8a/image_classification/models/uniformer.py#L341

But they do not have a strict ablation study (i.e. replacing only this patch embedding part) on this. I did not try, either, just followed their routines and focused on the attention part.

On 8 Apr 2023, at 12:58 PM, LMMMEng @.***> wrote:

Thank you for your wonderful work!

It has been noted that some recent works used two 3x3 convs (stride=2) instead of one 7x7 conv (stride=4) as stem, is it because the latter can lead to better results?

— Reply to this email directly, view it on GitHubhttps://github.com/rayleizhu/BiFormer/issues/4, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEYCTO3YTCY63YX7J43KK4TXADWANANCNFSM6AAAAAAWXGRUEI. You are receiving this because you are subscribed to this thread.Message ID: @.***>

LMMMEng commented 1 year ago

Got it, thank you!