microsoft / MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
https://aka.ms/MInference
MIT License
572 stars 20 forks source link

[Bug]: MInference必须使用fla-attention吗?加速推理,A6000服务器不支持flas-attention #24

Closed yawzhe closed 2 weeks ago

yawzhe commented 2 weeks ago

Describe the bug

No response

Steps to reproduce

No response

Expected Behavior

No response

Logs

No response

Additional Information

No response

iofu728 commented 2 weeks ago

This issue is being closed due to duplication #23.