Closed yawzhe closed 1 week ago
Hi @yawzhe,
Thank you for your support in MInference.
After reviewing the code, it appears that our decoding stage relies on flash-attn
. For the prefilling stage, the three ops used are based on a Triton-implemented version of dynamic sparse flash attention, which does not depend on the flash-attn
library. We plan to support a version that does not rely on flash-attn
in the future, although this version will have slightly higher latency compared to the one using flash-attn
.
Please upgrade MInference to version 0.1.4.post3, which does not depend on flash_attn. You can do this by running the following command:
pip install minference==0.1.4.post3
Describe the issue
A6000支持加速推理吗?我们暂时没有A100,其他的服务器什么型号的可以呀