This is something we are actively working on supporting end-to-end.
In vllm, we currently support 2:4 sparsity with w4A16 and w8a16. We need to add inference kernels to support w8a8 fp8 with sparse 2:4. We are collaborating the cutlass teams on this.
This is something we are actively working on supporting end-to-end.
In vllm, we currently support 2:4 sparsity with w4A16 and w8a16. We need to add inference kernels to support w8a8 fp8 with sparse 2:4. We are collaborating the cutlass teams on this.