microsoft / MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

https://aka.ms/MInference

MIT License

573 stars 20 forks source link

add vllm support for 0.4.2 and 0.4.3 #19

Closed liyucheng09 closed 2 weeks ago

liyucheng09 commented 2 weeks ago

What does this PR do?

This PR add vllm support for 0.4.2 and 0.4.3.

Fixes #13 .

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes?
[ ] Did you write any new necessary tests?

Who can review?

@iofu728