Closed LucasWilkinson closed 2 months ago
👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck
CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check
build on Buildkite UI.
Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).
To run full CI, you can do one of these:
/ready
on the PRready
label to the PR🚀
Notes
This PR is a work in progress and based off of: https://github.com/vllm-project/vllm/pull/6396 so that will have to land before this.
Description
This PR introduces a spiritual successor to the Marlin kernel but optimized for Hopper architectures and based off of cutlass.
Motivation
The motivation for this kernel is multifold:
1) Marlin (v1) uses
mma
instructions, which are fastest tensor core instructions available on Ampere but with Hopper Nvidia release a set of newwgmma
instructions which are required to hit the peak FLOPs reported by Nvidia, without them i.e. usingmma
instructions you can expect to achieve at best ~75% of peak [1, 2] 2) Marlin (v1) uses a specific weight storage layout that is specialized for themma
instructions, we want to adopt a more flexible/dynamic way of defining these layouts so we can accommodate new instructions more rapidly, i.e.wgmma
and new instructions Blackwell introduces if anyTODO:
Current Performance
Float16
BFloat16