lucidrains / x-transformers

A concise but complete full-attention transformer with a set of promising experimental features from various papers
MIT License
4.78k stars 417 forks source link

Incorrect boolean mask in flash attention #160

Closed stoprightthere closed 1 year ago

stoprightthere commented 1 year ago

Hi,

Looks like flash_attn requires mask to be True whenever a position should NOT be attended to. However, F.scaled_dot_product_attention expects the inverse (True means that the position takes part in the attention), if the mask is boolean.

Unless there is attn_bias, there is no conversion, so essentially the mask is incorrect.

tl;dr: flip mask whenever it exists and attn_bias does not (and after applying the causal mask).

lucidrains commented 1 year ago

@stoprightthere shoot :man_facepalming:

could you see if https://github.com/lucidrains/x-transformers/commit/2a5bbdd83b74b61c944093b5d238b70aba196303 looks better?

lucidrains commented 1 year ago

@stoprightthere yea, i am keeping with the convention of True for attend, False for mask out. that should be the case for all my repos

stoprightthere commented 1 year ago

Thanks for the quick reply!

I think it should then be flipped in here and then here, possibly also other places :) Because you flip the mask input to Attention here

lucidrains commented 1 year ago

@stoprightthere oh! i forgot i had to change the convention to integrate flash attention for this repository

one more review? :pray: https://github.com/lucidrains/x-transformers/commit/d731d82b99fbb3491b6a5f0d2012ea2f523411d8

lucidrains commented 1 year ago

https://github.com/lucidrains/x-transformers/commit/d2e0700aa0f3648c89885e319636244140d2d602 actually, decided to just redo everything to be consistent

stoprightthere commented 1 year ago

looks good!

thanks for the work on this repo and for quick replies!

lucidrains commented 1 year ago

@stoprightthere no problem! thanks for identifying this issue and saving a lot of phd students a lot of headache lol