Open synacktraa opened 8 months ago
It's working without any problem but why the generation speed is slow compared non quantized models?
Hello!
There shouldn't be any major changes in generation, but attention_sinks
doesn't support flash attention in any of its models right now. Perhaps that's the difference in generation speed that you're experiencing?
Thanks for the fast response. Do you plan to work on it someday? I can implement it If you can explain flash attention a little bit.
It's available at this branch:https://github.com/Minami-su/attention_sinks_autogptq @synacktraa
It's available at this branch:https://github.com/Minami-su/attention_sinks_autogptq @synacktraa
Thankyou🙏
Can it handle GPTQ models like transformers library's
AutoModelForCausalLM
does?