sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sglang.readthedocs.io/en/latest/
Apache License 2.0
5.08k stars 354 forks source link

[Feature] DeepSeek-Coder-V2-Instruct-FP8 on 8xA100 #989

Open halexan opened 1 month ago

halexan commented 1 month ago

Motivation

VLLM has announced their support for running llama3.1-405b-fp8 on 8xA100. This is the blog

Does sglang support running DeepSeek-Coder-V2-Instruct-FP8 on 8xA100?

Related resources

No response

Ying1123 commented 1 month ago

llama-405b-fp8 is supported in sglang https://github.com/sgl-project/sglang/blob/228cf47547a3d2f7f38f636f40a5e85b0c3cd646/README.md?plain=1#L199-L200.

DeepSeek-Coder-V2-Instruct-FP8 should be supported as well. Could you try it and let us know if there are any problems?

Xu-Chen commented 1 month ago

VLLM don't support MoE FP8 models on Ampere. This is because vLLM uses Triton for its FusedMoE kernel, which doesn't support the FP8 Marlin mixed-precision gemm. See https://huggingface.co/neuralmagic/DeepSeek-Coder-V2-Instruct-FP8/discussions/1

Running DeepSeek-Coder-V2-Lite-Instruct-FP8, there is an error

  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model
    model.load_weights(
  File "/root/.local/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 694, in load_weights
    weight_loader(
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 205, in weight_loader
    raise ValueError(
ValueError: input_scales of w1 and w3 of a layer must be equal. But got 0.06986899673938751 vs. 0.09467455744743347
halexan commented 1 month ago

VLLM don't support MoE FP8 models on Ampere. This is because vLLM uses Triton for its FusedMoE kernel, which doesn't support the FP8 Marlin mixed-precision gemm. See https://huggingface.co/neuralmagic/DeepSeek-Coder-V2-Instruct-FP8/discussions/1

Running DeepSeek-Coder-V2-Lite-Instruct-FP8, there is an error

  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model
    model.load_weights(
  File "/root/.local/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 694, in load_weights
    weight_loader(
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 205, in weight_loader
    raise ValueError(
ValueError: input_scales of w1 and w3 of a layer must be equal. But got 0.06986899673938751 vs. 0.09467455744743347

What is your vllm version?

Xu-Chen commented 1 month ago

VLLM don't support MoE FP8 models on Ampere. This is because vLLM uses Triton for its FusedMoE kernel, which doesn't support the FP8 Marlin mixed-precision gemm. See https://huggingface.co/neuralmagic/DeepSeek-Coder-V2-Instruct-FP8/discussions/1

Running DeepSeek-Coder-V2-Lite-Instruct-FP8, there is an error

  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model
    model.load_weights(
  File "/root/.local/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 694, in load_weights
    weight_loader(
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 205, in weight_loader
    raise ValueError(
ValueError: input_scales of w1 and w3 of a layer must be equal. But got 0.06986899673938751 vs. 0.09467455744743347

What is your vllm version?

0.5.4

KylinMountain commented 1 month ago

@Xu-Chen So can we use sglang to run deepseek v2 232B? Thanks

halexan commented 1 month ago

@Xu-Chen So can we use sglang to run deepseek v2 232B? Thanks

Yes, you can, without quantization.