Open Jiayi-Pan opened 1 month ago
@mgoin could you take a look at this?
I think this is an issue with ReplicatedLinear
Sure, but regardless of this current issue, I believe these are Ampere GPUs - which we don't have support in the FP8 Triton MoE kernel for.
@mgoin Sorry to bother you. I got the same error when I ran DeepSeek-Coder-V2-Lite-Base-FP8 on two 4090s. My execution command is: vllm serve DeepSeek-Coder-V2-Lite-Base-FP8 --gpu-memory-utilization 0.9 --trust-remote-code --max-model-len 10000 --enable-chunked-prefill=False --tensor-parallel-size 2 --enforce_eager Is it the same reason?
Yes, fp8 for MoE needs cc 9.0 and I belive 4090 is 8.9.
We need to wait for pytorch to upgrade to triton 3.0 to support 8.9
the same problem with L20 gpu
any news on this, considering the new DeepSeek v2.5 release now?
This should work with the latest release - have you tried vllm 0.6.0 and saw the same issue?
This should work with the latest release - have you tried vllm 0.6.0 and saw the same issue?
the 0.6.0 docker container gives me this on 8xA6000 Ampere (DeepSeek Coder V2)
docker run --name vllm_container --gpus=all -e VLLM_ENGINE_ITERATION_TIMEOUT_S=1200 -p 7861:8000 --ipc=host --shm-size=32gb -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 -v /srv/syn/models/deploy/instruct/deepseek-ai_DeepSeek-Coder-V2-Instruct:/srv/syn/models/deploy/instruct/deepseek-ai_DeepSeek-Coder-V2-Instruct vllm/vllm-openai:v0.6.0 --host 0.0.0.0 --served-model-name tgi --tensor-parallel-size 8 --max-num-seqs 16 --model /srv/syn/models/deploy/instruct/deepseek-ai_DeepSeek-Coder-V2-Instruct --max-model-len 8192 --max-num-batched-tokens 8192 --trust-remote-code --enforce-eager --quantization fp8
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 34, in init
self.engine = AsyncLLMEngine.from_engine_args(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 735, in from_engine_args
engine = cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 615, in init
self.engine = self._init_engine(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 835, in _init_engine
return engine_class(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 262, in init
super().init(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 305, in init
self.model_executor = executor_class(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 222, in init
super().init(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 26, in init
super().init(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in init
self._init_executor()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 125, in _init_executor
self._run_workers("load_model",
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 199, in _run_workers
driver_worker_output = driver_worker_method(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 182, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 917, in load_model
self.model = get_model(model_config=self.model_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
return loader.load_model(model_config=model_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 341, in load_model
model = _initialize_model(model_config, self.load_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 170, in _initialize_model
return build_model(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 155, in build_model
return model_class(config=hf_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 486, in init
self.model = DeepseekV2Model(config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 428, in init
self.start_layer, self.end_layer, self.layers = make_layers(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 247, in makelayers
[PPMissingLayer() for in range(start_layer)] + [
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 248, in
@freegheist You are loading an fp16 checkpoint and dynamically quantizing it to fp8 after loading. This is running out of memory because you don't have enough memory to hold the whole fp16 checkpoint before the quantization. You need to use an already quantized FP8 checkpoint in order to fit into your system - you should be able to try https://huggingface.co/neuralmagic/DeepSeek-Coder-V2-Instruct-FP8
@freegheist You are loading an fp16 checkpoint and dynamically quantizing it to fp8 after loading. This is running out of memory because you don't have enough memory to hold the whole fp16 checkpoint before the quantization. You need to use an already quantized FP8 checkpoint in order to fit into your system - you should be able to try https://huggingface.co/neuralmagic/DeepSeek-Coder-V2-Instruct-FP8
Thanks for that info... the error happens quick, didnt seem to OOM on the RAM or SWAP but makes sense!
I'm trying the FP8 now that gives the below error:
docker run --name vllm_container --gpus=all -e VLLM_ENGINE_ITERATION_TIMEOUT_S=1200 -p 7861:8000 --ipc=host --shm-size=8gb -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 -v /srv/syn/models/deploy/instruct/neuralmagic_DeepSeek-Coder-V2-Instruct-FP8:/srv/syn/models/deploy/instruct/neuralmagic_DeepSeek-Coder-V2-Instruct-FP8 vllm/vllm-openai:v0.6.0 --host 0.0.0.0 --served-model-name tgi --tensor-parallel-size 8 --max-num-seqs 16 --gpu-memory-utilization 0.9999 --model /srv/syn/models/deploy/instruct/neuralmagic_DeepSeek-Coder-V2-Instruct-FP8 --max-model-len 8192 --max-num-batched-tokens 8192 --trust-remote-code --enforce-eager
(VllmWorkerProcess pid=121) INFO 09-10 00:28:35 model_runner.py:926] Loading model weights took 28.2876 GB
ERROR 09-10 00:28:38 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 121 died, exit code: -15
INFO 09-10 00:28:38 multiproc_worker_utils.py:123] Killing local vLLM worker processes
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 34, in init
self.engine = AsyncLLMEngine.from_engine_args(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 735, in from_engine_args
engine = cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 615, in init
self.engine = self._init_engine(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 835, in _init_engine
return engine_class(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 262, in init
super().init(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 319, in init
self._initialize_kv_caches()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 448, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 39, in determine_num_available_blocks
num_blocks = self._run_workers("determine_num_available_blocks", )
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 199, in _run_workers
driver_worker_output = driver_worker_method(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 222, in determine_num_available_blocks
self.model_runner.profile_run()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1133, in profile_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1450, in execute_model
hidden_or_intermediate_states = model_executable(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 504, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 461, in forward
hidden_states, residual = layer(positions, hidden_states,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 401, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 148, in forward
final_hidden_states = self.experts(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 442, in forward
final_hidden_states = self.quant_method.apply(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 496, in apply
return fused_experts(x,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 647, in fused_experts
moe_align_block_size(curr_topk_ids, config['BLOCK_SIZE_M'], E))
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 228, in moe_align_block_size
ops.moe_align_block_size(topk_ids, num_experts, block_size, sorted_ids,
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 29, in wrapper
return fn(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 538, in moe_align_block_size
torch.ops._C.moe_align_block_size(topk_ids, num_experts, block_size,
File "/usr/local/lib/python3.10/dist-packages/torch/ops.py", line 1061, in call
return self._op(args, **(kwargs or {}))
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
ERROR 09-10 00:28:45 api_server.py:186] RPCServer process died before responding to readiness probe
Your current environment
🐛 Describe the bug