how causal mask constructed in training batch model with linear causal attention?

Hi! I have a few questions about the difference in models.

I understand how the recursive model is set up, it is described in the publication. But how is effective model learning achieved in batch fashion? As far as I understand, because we never explicitly calculate the attention matrix we can't just apply a triangular mask. How does this work then? Is it just iterative as in the recursive model, but implemented on cuda? Is it easily parallelizable as 3 matrix multiplications (like in full attention)?

Thanks!

idiap / fast-transformers

how causal mask constructed in training batch model with linear causal attention? #109