When can we support w8a8 fp8 quantization and sparse2:4 llm compress and adapt it on vllm?

vllm-project / llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

Apache License 2.0

662 stars 54 forks source link

When can we support w8a8 fp8 quantization and sparse2:4 llm compress and adapt it on vllm? #148

Open leoyuppieqnew opened 2 months ago

robertgshaw2-neuralmagic commented 2 months ago

This is something we are actively working on supporting end-to-end.

In vllm, we currently support 2:4 sparsity with w4A16 and w8a16. We need to add inference kernels to support w8a8 fp8 with sparse 2:4. We are collaborating the cutlass teams on this.

markurtz commented 3 weeks ago

A quick update: We hope to have end-to-end support and a model launch within the next few weeks!