vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.91k stars 4.7k forks source link

[Bug]: FP8 Quantization support for AMD GPUs #7471

Open rathnaum opened 3 months ago

rathnaum commented 3 months ago

Your current environment

I am trying out FP8 support on AMD GPUs (MI250, MI300) and the vLLM library does not seem to support AMD GPUs yet for FP8 quantization. Is there any timeline for when this will be available?

🐛 Describe the bug

Error: "fp8 quantization is currently not supported in ROCm" while running vLLM with quantization=fp8 on AMD GPUs. I am using MI250 AMD GPUs to run the vLLM inference service.

ferrybaltimore commented 3 months ago

Here, you have aprox instruction for having it working.

https://github.com/vllm-project/vllm/issues/6576

You can use the master branch at least it give me less problems than fp8-gemm.

But with a mi300x teh performance of FP8 is more than 3 times slower than half.

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!