vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.87k stars 3.95k forks source link

[Bug]: FP8 Quantization support for AMD GPUs #7471

Open rathnaum opened 1 month ago

rathnaum commented 1 month ago

Your current environment

I am trying out FP8 support on AMD GPUs (MI250, MI300) and the vLLM library does not seem to support AMD GPUs yet for FP8 quantization. Is there any timeline for when this will be available?

🐛 Describe the bug

Error: "fp8 quantization is currently not supported in ROCm" while running vLLM with quantization=fp8 on AMD GPUs. I am using MI250 AMD GPUs to run the vLLM inference service.

ferrybaltimore commented 1 month ago

Here, you have aprox instruction for having it working.

https://github.com/vllm-project/vllm/issues/6576

You can use the master branch at least it give me less problems than fp8-gemm.

But with a mi300x teh performance of FP8 is more than 3 times slower than half.