microsoft / MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
https://aka.ms/MInference
MIT License
573 stars 20 forks source link

add vllm support for 0.4.2 and 0.4.3 #19

Closed liyucheng09 closed 2 weeks ago

liyucheng09 commented 2 weeks ago

What does this PR do?

This PR add vllm support for 0.4.2 and 0.4.3.

Fixes #13 .

Before submitting

Who can review?

@iofu728