Marlin Mixed Input Kernel Productionization

jwfromm commented 3 weeks ago

Summary: This diff does quite a bit of facelifting to our Marlin BF16 X I4 kernels. These improvements include:

Upgrading the kernel with the latest improvements from VLLM. This helps quite a bit with stability and fixes issues with group scaling.
Adds template specializations so that the marlin kernel supports both BF16 and FP16 using a single implementation.
Fixes BF16 Dequantization issue.
Exposes a simplified torch custom op torch.ops.marlin.marlin_gemm and convenient helpers for quantizing to the marlin format marlin_quantize.
Adds these new ops to our quantize benchmarks.
New tests and better directory structure.

One downside of this work is that we have diverged a bit from VLLM so it may be harder to stay in sync going forward. However, I think the benefits of the improvements in this diff outweigh potential sync costs.

Differential Revision: D61408771

netlify[bot] commented 3 weeks ago

Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
Latest commit	d5db96230b0db325d72e7a6f89ef22c1055bc159
Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66c3d09495815f0008d169be
Deploy Preview	https://deploy-preview-3008--pytorch-fbgemm-docs.netlify.app
Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.