vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.75k stars 4.67k forks source link

[Feature]: DeepSeek-Coder-V2-Instruct-FP8 on 8xA100 #7322

Closed halexan closed 3 months ago

halexan commented 3 months ago

🚀 The feature, motivation and pitch

VLLM has announced support for running llama3.1-405b-fp8 on 8xA100. This is the blog

Does vllm support running DeepSeek-Coder-V2-Instruct-FP8 on 8xA100?

However, I notice that vLLM uses Triton for its FusedMoE kernel, which doesn't support the FP8 Marlin mixed-precision. See https://github.com/sgl-project/sglang/issues/989#issuecomment-2275698772

Is there any work around?

Alternatives

No response

Additional context

No response

robertgshaw2-neuralmagic commented 3 months ago

There is not currently a workaround for this. We have been working on extending Marlin to support FusedMoE and will likely extend this to fp8 at some point. But this will take some time

see: https://github.com/vllm-project/vllm/pull/7079 for progress of marlin fused_moe

robertgshaw2-neuralmagic commented 3 months ago

Closing for now.

jon-chuang commented 3 months ago

Hello @robertgshaw2-neuralmagic , may I ask why an FP8 quantized model would used an FP16XINT4 mm kernel? Could you point to some resources or blog post about this? Thank you.

robertgshaw2-neuralmagic commented 3 months ago

Marlin is a mixed precision inference kernel. It supports int4 weights, int8 weights, and fp8 weights with 16 bit activations (for dense models)

we started by extending marlin to support fused moe with int4 and int8 weights and fp16 activations (the pr I linked). A follow up to this will be extending to support fp8 weights as well

jon-chuang commented 3 months ago

At what batch size does Marlin become optimal (I.e. roofline) for FP8?

robertgshaw2-neuralmagic commented 3 months ago

I’m not sure I follow the question.

The Roofline analysis shows the latency of the kernel as a function of batch size. Marlin GEMM is a highly optimized kernel that was designed to address performance issues with the prior generation of mixed precision kernels which did not perform well in the batch 8-64 range even though the computation is memory bound.

So, marlin follows the roofline plot very well. But, you should not expect marlin to accelerate compute bound workloads over fp16. For compute bound workloads we recommend using activation quantization

robertgshaw2-neuralmagic commented 3 months ago

One follow up - If you’re running on Hopper, I don’t think it makes sense to use marlin for fp8 since we can use dyanmic activation quantization with high accuracy. The only use of marlin fp8 IMO should be for devices which do not support fp8 compute (I.e a100)

jon-chuang commented 3 months ago

I see, thank you for the detailed response!