Open halexan opened 1 month ago
llama-405b-fp8 is supported in sglang https://github.com/sgl-project/sglang/blob/228cf47547a3d2f7f38f636f40a5e85b0c3cd646/README.md?plain=1#L199-L200.
DeepSeek-Coder-V2-Instruct-FP8 should be supported as well. Could you try it and let us know if there are any problems?
VLLM don't support MoE FP8 models on Ampere. This is because vLLM uses Triton for its FusedMoE kernel, which doesn't support the FP8 Marlin mixed-precision gemm. See https://huggingface.co/neuralmagic/DeepSeek-Coder-V2-Instruct-FP8/discussions/1
Running DeepSeek-Coder-V2-Lite-Instruct-FP8, there is an error
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model
model.load_weights(
File "/root/.local/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 694, in load_weights
weight_loader(
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 205, in weight_loader
raise ValueError(
ValueError: input_scales of w1 and w3 of a layer must be equal. But got 0.06986899673938751 vs. 0.09467455744743347
VLLM don't support MoE FP8 models on Ampere. This is because vLLM uses Triton for its FusedMoE kernel, which doesn't support the FP8 Marlin mixed-precision gemm. See https://huggingface.co/neuralmagic/DeepSeek-Coder-V2-Instruct-FP8/discussions/1
Running DeepSeek-Coder-V2-Lite-Instruct-FP8, there is an error
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model model.load_weights( File "/root/.local/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 694, in load_weights weight_loader( File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 205, in weight_loader raise ValueError( ValueError: input_scales of w1 and w3 of a layer must be equal. But got 0.06986899673938751 vs. 0.09467455744743347
What is your vllm version?
VLLM don't support MoE FP8 models on Ampere. This is because vLLM uses Triton for its FusedMoE kernel, which doesn't support the FP8 Marlin mixed-precision gemm. See https://huggingface.co/neuralmagic/DeepSeek-Coder-V2-Instruct-FP8/discussions/1
Running DeepSeek-Coder-V2-Lite-Instruct-FP8, there is an error
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model model.load_weights( File "/root/.local/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 694, in load_weights weight_loader( File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 205, in weight_loader raise ValueError( ValueError: input_scales of w1 and w3 of a layer must be equal. But got 0.06986899673938751 vs. 0.09467455744743347
What is your vllm version?
0.5.4
@Xu-Chen So can we use sglang to run deepseek v2 232B? Thanks
@Xu-Chen So can we use sglang to run deepseek v2 232B? Thanks
Yes, you can, without quantization.
Motivation
VLLM has announced their support for running llama3.1-405b-fp8 on 8xA100. This is the blog
Does sglang support running DeepSeek-Coder-V2-Instruct-FP8 on 8xA100?
Related resources
No response