neuralmagic / AutoFP8

Apache License 2.0
68 stars 10 forks source link

Can AutoFP8 quantized MOE model inferenced with vlllm?(kv_cache fp8 or kv_cache+weights fp8) #23

Open IEI-mjx opened 1 week ago

IEI-mjx commented 1 week ago

I have seen that the AutoFP8 quantized models from Huggingface, especially Mixtral-8x7B-FP8 is supported by vllm. I am wondering if both kv_cache and weight quantized models quantized by AutoFP8 are support by vllm. Our teams are using a MOE structured model (which is resemble to mixtral 8x7B) ,and we are not sure whether AutoFP8 quantized kv_cache_dtype=fp8 model is supported or not.

IEI-mjx commented 1 week ago

If it is suppored, which vllm version i should use?

mgoin commented 1 week ago

Hi @IEI-mjx currently vLLM only supports MixtralForCausalLM models for FP8 W8A8. We are currently working on a refactor to bring it to more MoE models. You can combine FP8 W8A8 with FP8 kv cache freely, however the performance of FP8 kv cache has not been tuned. This is true as of vLLM 0.5.0

Here is an example of calibrating static kv_scale for FP8, as well as activation scales: https://github.com/neuralmagic/AutoFP8/blob/main/examples/example_static_kvcache.py

Note you will need to include ignore_patterns=["re:.*lm_head", "re:.*gate"], for Mixtral to ignore the gate layer

IEI-mjx commented 5 days ago

Thanks a lot. That really help!