[Question]: what is the speedup of attention kernel of current implemetation?

microsoft / MInference

[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

https://aka.ms/MInference

MIT License

816 stars 38 forks source link

[Question]: what is the speedup of attention kernel of current implemetation? #73

Open foreverpiano opened 2 months ago

foreverpiano commented 2 months ago

Describe the issue

The pattern is good, but I wonder if we have hardware efficient kernel of these patterns. Have you test this sparse SDPA attention kernel speedup compared to original causal attention?

iofu728 commented 2 months ago

Hi @foreverpiano, thank you for your interest in MInference.

We have released the GPU kernel in the library. You can follow the startup guide to use it.

We have also presented end-to-end speedup and micro benchmark results in https://github.com/microsoft/MInference/tree/main/experiments#minference-benchmark-experiments and Appendix D.2.