Open foreverpiano opened 2 months ago
Hi @foreverpiano, thank you for your interest in MInference.
We have released the GPU kernel in the library. You can follow the startup guide to use it.
We have also presented end-to-end speedup and micro benchmark results in https://github.com/microsoft/MInference/tree/main/experiments#minference-benchmark-experiments and Appendix D.2.
Describe the issue
The pattern is good, but I wonder if we have hardware efficient kernel of these patterns. Have you test this sparse SDPA attention kernel speedup compared to original causal attention?