microsoft / MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
https://aka.ms/MInference
MIT License
571 stars 20 forks source link

Doc(MInference): update paper information #3

Closed iofu728 closed 3 weeks ago

iofu728 commented 3 weeks ago