Closed jameslahm closed 1 year ago
Hi @jameslahm ,
Thank you for your question. The imagenet flops is reported in tab5 is following the tab4 in fairnas. The reason of window size 4 on imagenet is because of the 224 input resolution will become 4 after 64× downsampling. Hope this clarifies!
Thank you for your response!
Thank you for your great work! I found that the resolution in CascadedGroupAttention in the last stage of EfficientViT-M4 is 7 rather than 4, as shown in https://github.com/microsoft/Cream/blob/8dc38822b99fff8c262c585a32a4f09ac504d693/EfficientViT/downstream/efficientvit.py#L228 There is no
window_resolution = min(window_resolution, resolution)
like https://github.com/microsoft/Cream/blob/8dc38822b99fff8c262c585a32a4f09ac504d693/EfficientViT/classification/model/efficientvit.py#L208. However, the Flops 299M for EfficientViT-M4 in Table 5 is the same as the Flops 299M for EfficientViT-M4 in Table 2. I wonder if the Flops 299M for EfficientViT-M4 in Table 5 is wrong. Thanks!