Open IEI-mjx opened 1 week ago
If it is suppored, which vllm version i should use?
Hi @IEI-mjx currently vLLM only supports MixtralForCausalLM models for FP8 W8A8. We are currently working on a refactor to bring it to more MoE models. You can combine FP8 W8A8 with FP8 kv cache freely, however the performance of FP8 kv cache has not been tuned. This is true as of vLLM 0.5.0
Here is an example of calibrating static kv_scale for FP8, as well as activation scales: https://github.com/neuralmagic/AutoFP8/blob/main/examples/example_static_kvcache.py
Note you will need to include ignore_patterns=["re:.*lm_head", "re:.*gate"],
for Mixtral to ignore the gate layer
Thanks a lot. That really help!
I have seen that the AutoFP8 quantized models from Huggingface, especially Mixtral-8x7B-FP8 is supported by vllm. I am wondering if both kv_cache and weight quantized models quantized by AutoFP8 are support by vllm. Our teams are using a MOE structured model (which is resemble to mixtral 8x7B) ,and we are not sure whether AutoFP8 quantized kv_cache_dtype=fp8 model is supported or not.