Sliding window for transformer

Hi, I was wondering if it was possible to implement a sliding window decoder for the transformer? When increasing the max sequences length, the training time goes up dramatic and and I think that using a sliding decoder would greatly help with the training and inference speed.

I've tried using LocalAttention but I'm not sure how to properly implement it since it inputs q, k and v.

I know @lucidrains have already spent all their allotted timed and more for this project so if I could be provided with some tips I could try to implement it.

lucidrains / meshgpt-pytorch

Sliding window for transformer #61