idiap / fast-transformers

Pytorch library for fast transformer implementations
1.65k stars 179 forks source link

how causal mask constructed in training batch model with linear causal attention? #109

Open Howuhh opened 2 years ago

Howuhh commented 2 years ago

Hi! I have a few questions about the difference in models.

I understand how the recursive model is set up, it is described in the publication. But how is effective model learning achieved in batch fashion? As far as I understand, because we never explicitly calculate the attention matrix we can't just apply a triangular mask. How does this work then? Is it just iterative as in the recursive model, but implemented on cuda? Is it easily parallelizable as 3 matrix multiplications (like in full attention)?

Thanks!