tomaarsen / attention_sinks

Extend existing LLMs way beyond the original training length with constant memory usage, without retraining
https://huggingface.co/blog/tomaarsen/attention-sinks
Apache License 2.0
650 stars 41 forks source link

GPTQ models support #31

Open synacktraa opened 8 months ago

synacktraa commented 8 months ago

Can it handle GPTQ models like transformers library's AutoModelForCausalLM does?

synacktraa commented 8 months ago

It's working without any problem but why the generation speed is slow compared non quantized models?

tomaarsen commented 8 months ago

Hello!

There shouldn't be any major changes in generation, but attention_sinks doesn't support flash attention in any of its models right now. Perhaps that's the difference in generation speed that you're experiencing?

synacktraa commented 8 months ago

Thanks for the fast response. Do you plan to work on it someday? I can implement it If you can explain flash attention a little bit.

Minami-su commented 6 months ago

It's available at this branch:https://github.com/Minami-su/attention_sinks_autogptq @synacktraa

synacktraa commented 6 months ago

It's available at this branch:https://github.com/Minami-su/attention_sinks_autogptq @synacktraa

Thankyou🙏