[WIP, Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel

Notes

This PR is a work in progress and based off of: https://github.com/vllm-project/vllm/pull/6396 so that will have to land before this.

Description

This PR introduces a spiritual successor to the Marlin kernel but optimized for Hopper architectures and based off of cutlass.

Motivation

The motivation for this kernel is multifold:

1) Marlin (v1) uses mma instructions, which are fastest tensor core instructions available on Ampere but with Hopper Nvidia release a set of new wgmma instructions which are required to hit the peak FLOPs reported by Nvidia, without them i.e. using mma instructions you can expect to achieve at best ~75% of peak [1, 2] 2) Marlin (v1) uses a specific weight storage layout that is specialized for the mma instructions, we want to adopt a more flexible/dynamic way of defining these layouts so we can accommodate new instructions more rapidly, i.e. wgmma and new instructions Blackwell introduces if any

MarlinV2 achieves this by describing the weight storage scheme using cutlass and CUTE 3) Marlin (v1) does not support cutlass epilogues, we eventually plan to investigate subbyte weight quantization + activation quantization, for activation quantization we'd like to leverage the great work done by @tlrmchlsmth @varun-sundar-rabindranath and @ProExpertProg to write custom cutlass epilogues for fp8 and int8

TODO:

[x] Chose a new name (candidates: wahoo, swordfish (kinda cutlass + marlin), non-fish names ...): edit: chose machete
[x] Improve heuristic namely for 4096x4096: resolved by moving heuristic into the C++ code
[ ] Improve BFloat16 performance (via bit shift or interleaving) (future PR)
[ ] E2E integration (future PR)
[ ] Improve batch size < 32 performance (potentially a future PR, likely through improving the stream-k scheduler) (future PR)
[ ] Investigate fp8 activation support (future PR)

neuralmagic / nm-vllm