sustcsonglin / flash-linear-attention

Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton
MIT License
1.18k stars 61 forks source link

Hello from HF Diffusers #46

Open sayakpaul opened 1 month ago

sayakpaul commented 1 month ago

Thanks for the incredibly clean repository!

I am Sayak from the Diffusers team at Hugging Face. My question is probably very naive, so I apologize for that in advance.

I wanted to know if linear attention could applied in inference time only? More precisely, can I take a model trained with regular attention and turn it into a linear attention model during inference?

sustcsonglin commented 1 month ago

Hello Sayak,

Thanks for your interest! Unfortunately, we cannot directly convert a model with softmax attention to one with linear attention during inference without any additional training. However, it is indeed possible to finetune pretrained LLMs for a few steps—much fewer than training from scratch—to switch from regular attention to linear attention. You can refer to these resources for more details: arXiv:2405.06640, OpenReview, etc.

yzhangcs commented 1 month ago

@sayakpaul FYI, we release some weights converted from Mistral-7B-v0.1 as in arXiv:2405.06640. You can have a try by loading fla-hub/gla-7B-mistral-20B, fla-hub/gsa-7B-mistral-20B or fla-hub/gsa-7B-mistral-100B

sayakpaul commented 1 month ago

Would there be any interest to try something similar for diffusion models? I am happy to team up. I believe this will be a very significant contribution to the community now that large models like Flux are coming up.