luissen / ESRT

MIT License
153 stars 40 forks source link

what is the difference between Feature Split (FS) in EMHA and window-attention? #7

Open rami0205 opened 1 year ago

rami0205 commented 1 year ago

Thank you for your work.

after reading your paper, I have a question.

In Feature Split (FS) of sec. 3.2.2 Efficient Transformer, I was confused with the difference between this FS and window-attention (from Swin-Transformer).

Your FS splits the features into N/s x N/s, and window-attention (of Swin-Transformer) splits the features into N/M x N/M, where M is window size.

self-attention is calculated within N/s x N/s (by FS) and N/M x N/M (by window-partitioning), respectively.

s (in FS) and M (in window-attention) can be different in that the values differ, but I don't understand the mechanism differences between them.

Once more, thank you for your hard work.