Open LeiWang1999 opened 2 months ago
Looking ahead, our future plan for v0.0.2 should include at least support for the Marlin template, quantized Flash Attention, and Group MOE :)
pr #153 serialized the kernel name with operator config and hint.
From a policy Perspective, I think we should currently use LOP.3 only for weight propagation, this approach is compatible not only with A100 devices but also with other common devices, such as SM 70 or AMD (even though it’s not currently implemented for AMD, but it could be).
For Stage3 Performance, we can provide option to enable.
Moreover, the incoming stream_k template should share the same weight transformation function with Stage3.
Hi all, it's time for us to considering the official release of BitBLAS v0.0.1, here are some todo items before this release: