Closed wanzhenchn closed 1 month ago
I had the same issue on my machine. Seems to be an issue with 0.5.3(and .post1). This worked on 0.5.2.
In addition, kv8 fails with other quant like GPTQ or AWQ.
Thank you for reporting @wanzhenchn @w013nad @QwertyJack . This unfortunately was a simple issue that didn't have testing. This will be resolved in the attached PR
Thank you for reporting @wanzhenchn @w013nad @QwertyJack . This unfortunately was a simple issue that didn't have testing. This will be resolved in the attached PR
Many thanks for your response.
I noticed that when passing--quantization fp8 --kv-cache-dtype fp8 --quantization-param-path kv_cache_scales.json
, the k/v scale is set to 1.0.
What is the rationale behind this? Why aren't the k/v scales read from the JSON file? the is_hip()
is False
on NV GPU.
@mgoin
I had the same issue on my machine. Seems to be an issue with 0.5.3(and .post1). This worked on 0.5.2.
Yeah, I also noticed that.
Your current environment
🐛 Describe the bug
I followed the docs (https://github.com/vllm-project/vllm/blob/main/docs/source/quantization/fp8.rst and https://github.com/vllm-project/vllm/tree/main/examples/fp8) to quantize vicuna-13b-v1.5 with fp8 precision on 1*H100, and got the fp8 models and kv_cache_scales.json file successfully using following commands:
Credit to: https://github.com/vllm-project/vllm/blob/main/docs/source/quantization/fp8.rst
class AutoFP8: def init(self, model_path: str, saved_path: str, calib_size: int = 512, activation_scheme: str = "static"): self.saved_path = saved_path self.calib_size = calib_size
def main(model_path: str, saved_path: str, calib_size: int = 512, ): fp8_helper = AutoFP8(model_path, saved_path, calib_size) fp8_helper.apply_fp8()
if name == "main": fire.Fire(main)
However, when I launched openai server with
--quantization fp8
,--kv-cache-dtype fp8
and--quantization-param-path ${output_kv_cache_scales_file}
, problems occurred below:But if we only pass
--quantization fp8
when launching server, everything works well.I think that there are some gaps between
nvidia-ammo
(called modelopt now )tool and the lastestvllm
, the docs for FP8 KV Cache need to be updated promptly.