microsoft / MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
https://aka.ms/MInference
MIT License
572 stars 20 forks source link

[Question]: Is A6000 supported? #23

Closed yawzhe closed 1 week ago

yawzhe commented 2 weeks ago

Describe the issue

A6000支持加速推理吗?我们暂时没有A100,其他的服务器什么型号的可以呀

iofu728 commented 2 weeks ago

Hi @yawzhe,

Thank you for your support in MInference.

After reviewing the code, it appears that our decoding stage relies on flash-attn. For the prefilling stage, the three ops used are based on a Triton-implemented version of dynamic sparse flash attention, which does not depend on the flash-attn library. We plan to support a version that does not rely on flash-attn in the future, although this version will have slightly higher latency compared to the one using flash-attn.

iofu728 commented 1 week ago

Please upgrade MInference to version 0.1.4.post3, which does not depend on flash_attn. You can do this by running the following command:

pip install minference==0.1.4.post3