issues
search
microsoft
/
MInference
To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
https://aka.ms/MInference
MIT License
723
stars
29
forks
source link
PreRelease: v0.1.0
#2
Closed
iofu728
closed
3 months ago
iofu728
commented
3 months ago
release MInference pip package;
add experiments details;
add examples;
add github action;