microsoft / Focal-Transformer

[NeurIPS 2021 Spotlight] Official code for "Focal Self-attention for Local-Global Interactions in Vision Transformers"
MIT License
546 stars 59 forks source link

Relationship between focal window size and focal region size #5

Closed liyiersan closed 2 years ago

liyiersan commented 2 years ago

I was confused by the relationship between focal window size and focal region size. Can you explained it more clearly? Take stage 1 level 0 as an example, sw=1, sr = 13, sw*sr could not be divided by the Ouput Size 56. I could not understand why sr is 13. Thanks a lot if you could help me.

jwyang commented 2 years ago

Hi, @liyiersan

Focal window size means on which feature map size the window pooling is performed, while focal region size means the size of region the query in a local window will attend to. For example, at level 0, we use the most fine-grained tokens and thus do not use any window pooling, so sw=1, but the focal region becomes 7+2*3 = 13, where 7 is the size of each window, 3 is the extension to its all sides so that the window tokens can attend to their surroundings outside the local window. Hope this clarifies the idea of focal attention, but please let me know if it is still confusing. thanks!

liyiersan commented 2 years ago

In the fine-gained, torch.roll is used to expand the key and value set. In the coarse-gained, torch.unfold is used. I was confused that why torch.roll and torch.unfold are needed? see isuse6 in detail. Thanks very much. So sorry for my poor English.

jwyang commented 2 years ago

I believe this issue has been addressed.