Open sayakpaul opened 1 month ago
Hello Sayak,
Thanks for your interest! Unfortunately, we cannot directly convert a model with softmax attention to one with linear attention during inference without any additional training. However, it is indeed possible to finetune pretrained LLMs for a few steps—much fewer than training from scratch—to switch from regular attention to linear attention. You can refer to these resources for more details: arXiv:2405.06640, OpenReview, etc.
@sayakpaul FYI, we release some weights converted from Mistral-7B-v0.1 as in arXiv:2405.06640. You can have a try by loading fla-hub/gla-7B-mistral-20B
, fla-hub/gsa-7B-mistral-20B
or fla-hub/gsa-7B-mistral-100B
Thanks for the incredibly clean repository!
I am Sayak from the Diffusers team at Hugging Face. My question is probably very naive, so I apologize for that in advance.
I wanted to know if linear attention could applied in inference time only? More precisely, can I take a model trained with regular attention and turn it into a linear attention model during inference?