lucidrains / local-attention

An implementation of local windowed attention for language modeling
MIT License
370 stars 40 forks source link

Which is exactly the attention pattern? #11

Closed beleen23 closed 1 year ago

beleen23 commented 2 years ago

I'm struggling to know how does the attention pattern look. I understand it works in blocks, and you can choose how many blocks forward and backward you want to attend. However, it is not clear to me how is the shift between blocks done. Is the pattern as in the picture below (for a window size of 3, look_forward=1, look_backward=1)? IMG_0069

Or is the shift just one token?

Thank you! :)

guoweiyu commented 1 year ago

I have the same question

lucidrains commented 1 year ago

yup, look forward and backwards is in multiple of the window size

your diagram is correct!

gordicaleksa commented 4 months ago

the README diagram confused me as well so for others & future me: for look_forward=0, look_backward=1 the diagram looks exactly the same you just dissect it with the line of symmetry and you remove everything that's above.

EDIT: actually with exact_windowsize set to True we get the pattern in the README that's not chopped.