microsoft / Cream

This is a collection of our NAS and Vision Transformer work.
MIT License
1.66k stars 225 forks source link

EfficientViT: wrong Flops for EfficientViT-M4 in Table 5? #183

Closed jameslahm closed 1 year ago

jameslahm commented 1 year ago

Thank you for your great work! I found that the resolution in CascadedGroupAttention in the last stage of EfficientViT-M4 is 7 rather than 4, as shown in https://github.com/microsoft/Cream/blob/8dc38822b99fff8c262c585a32a4f09ac504d693/EfficientViT/downstream/efficientvit.py#L228 There is no window_resolution = min(window_resolution, resolution) like https://github.com/microsoft/Cream/blob/8dc38822b99fff8c262c585a32a4f09ac504d693/EfficientViT/classification/model/efficientvit.py#L208. However, the Flops 299M for EfficientViT-M4 in Table 5 is the same as the Flops 299M for EfficientViT-M4 in Table 2. I wonder if the Flops 299M for EfficientViT-M4 in Table 5 is wrong. Thanks!

xinyuliu-jeffrey commented 1 year ago

Hi @jameslahm ,

Thank you for your question. The imagenet flops is reported in tab5 is following the tab4 in fairnas. The reason of window size 4 on imagenet is because of the 224 input resolution will become 4 after 64× downsampling. Hope this clarifies!

jameslahm commented 1 year ago

Thank you for your response!