Is the causal attention really works here?

lucidrains / linear-attention-transformer

Transformer based on a variant of attention that is linear complexity in respect to sequence length

MIT License

671 stars 65 forks source link

Is the causal attention really works here? #16

Open charlesxu90 opened 2 years ago

charlesxu90 commented 2 years ago

Found out that Efficient attention doesn't work on Causal attention scenario as mentioned here. https://github.com/cmsflash/efficient-attention/issues/4

So I doubt if the causal in this code really works?

lucidrains commented 2 years ago

@charlesxu90 yea it works

charlesxu90 commented 2 years ago

@lucidrains Thanks for answering me. Really appreciated!

Causal self-attention requires a triangle-like attention mask to mask out future tokens. In this code, I did find the interface you left for input_mask.

However, I didn't find the spot you initiate the attention mask. That's something confused me.

Thanks.