microsoft / BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
MIT License
403 stars 34 forks source link

Release Plan of BitBLAS 0.0.1 #150

Open LeiWang1999 opened 2 months ago

LeiWang1999 commented 2 months ago

Hi all, it's time for us to considering the official release of BitBLAS v0.0.1, here are some todo items before this release:

LeiWang1999 commented 2 months ago

Looking ahead, our future plan for v0.0.2 should include at least support for the Marlin template, quantized Flash Attention, and Group MOE :)

LeiWang1999 commented 2 months ago

pr #153 serialized the kernel name with operator config and hint.

LeiWang1999 commented 2 months ago

From a policy Perspective, I think we should currently use LOP.3 only for weight propagation, this approach is compatible not only with A100 devices but also with other common devices, such as SM 70 or AMD (even though it’s not currently implemented for AMD, but it could be).

For Stage3 Performance, we can provide option to enable.

Moreover, the incoming stream_k template should share the same weight transformation function with Stage3.