Open fengyang95 opened 3 weeks ago
could you please try the latest vllm 0.5.1 ?
BTW, why did you want to run deepseek-coder-v2-lite-instruct with 4L40 ? I just plan to deploy it on our server with 2P40, so hope to know what's your reason.
BTW, why did you want to run deepseek-coder-v2-lite-instruct with 4_L40 ? I just plan to deploy it on our server with 2_P40, so hope to know what's your reason.
1 L40 is enough; I'm just testing with four cards
could you please try the latest vllm 0.5.1 ?
nice! I will. try with it
Hi @fengyang95 I tried deepseek-coder-v2-lite-instruct can be started on 2 x L40 GPU,but the context cannot reach 128K, only 9415 tokens in my test. Did you encountered same issue? Below is my start cmd.
python3 -m vllm.entrypoints.openai.api_server --dtype float16 --trust-remote-code --model DeepSeek-Coder-V2-Lite-Instruct --port 9000 --host 0.0.0.0 --tensor-parallel-size 2 --max-seq-len 63040 --max-model-len 30720
When I remove the --max-seq-len 63040 --max-model-len 30720, it will reports error when start:
[rank0]: ValueError: The model's max seq len (163840) is larger than the maximum number of tokens that can be stored in KV cache (63040). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Hi @fengyang95 I tried deepseek-coder-v2-lite-instruct can be started on 2 x L40 GPU,but the context cannot reach 128K, only 9415 tokens in my test. Did you encountered same issue? Below is my start cmd.
python3 -m vllm.entrypoints.openai.api_server --dtype float16 --trust-remote-code --model DeepSeek-Coder-V2-Lite-Instruct --port 9000 --host 0.0.0.0 --tensor-parallel-size 2 --max-seq-len 63040 --max-model-len 30720
When I remove the --max-seq-len 63040 --max-model-len 30720, it will reports error when start:
[rank0]: ValueError: The model's max seq len (163840) is larger than the maximum number of tokens that can be stored in KV cache (63040). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '
Yes, you need to reduce max_len.
Your current environment
🐛 Describe the bug
When starting deepseek-coder-v2-lite-instruct with vllm on 4 GPUs, one of them is at 0%. There is no issue when tensor_parallel_size=1.