lucidrains / meshgpt-pytorch

Implementation of MeshGPT, SOTA Mesh generation using Attention, in Pytorch
MIT License
700 stars 57 forks source link

Sliding window for transformer #61

Open MarcusLoppe opened 6 months ago

MarcusLoppe commented 6 months ago

Hi, I was wondering if it was possible to implement a sliding window decoder for the transformer? When increasing the max sequences length, the training time goes up dramatic and and I think that using a sliding decoder would greatly help with the training and inference speed.

I've tried using LocalAttention but I'm not sure how to properly implement it since it inputs q, k and v.

I know @lucidrains have already spent all their allotted timed and more for this project so if I could be provided with some tips I could try to implement it.

lucidrains commented 6 months ago

@MarcusLoppe local attention is a good exercise to implement, moderate difficulty for a research engineer. getting kv cache working for bonus points..