vllm-project / llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Apache License 2.0
644 stars 52 forks source link

[Usage] Can I do GPTQ with FP8 KV cache scheme? #137

Open CharlesRiggins opened 2 months ago

CharlesRiggins commented 2 months ago

I want to quantize the KV cache to FP8 E4M3 on top of GPTQ. Is it possible to do it with llm-compressor?

robertgshaw2-neuralmagic commented 2 months ago

@mgoin @horheynm - could you provide an example of this?

horheynm commented 1 month ago

Hi @CharlesRiggins

Thank you for using llm-compressor. We are currently working on this feature this sprint! You will be able to do this very shortly, please give us couple days and we will get back with example script for you to try out!

markurtz commented 2 weeks ago

@horheynm do we have an update on the status for this?