vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.93k stars 4.12k forks source link

[Usage]: How to increase the context length when start with vllm.entrypoints.openai.api_server #6211

Open garyyang85 opened 3 months ago

garyyang85 commented 3 months ago

Your current environment

I tried deepseek-coder-v2-lite-instruct can be started on 2 x L40 GPU with vllm 0.5.1,but the context cannot reach 128K, only 9415 tokens in my test. Below is my start cmd.

python3 -m vllm.entrypoints.openai.api_server --dtype float16 --trust-remote-code --model DeepSeek-Coder-V2-Lite-Instruct --port 9000 --host 0.0.0.0    --tensor-parallel-size 2 --max-seq-len 63040 --max-model-len 30720

When I remove the --max-seq-len 63040 --max-model-len 30720, it will report error when start:

[rank0]: ValueError: The model's max seq len (163840) is larger than the maximum number of tokens that can be stored in KV cache (63040). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Semihal commented 3 months ago

deepseek v2 lite has context 32k upd: Sorry, I made a mistake.

simon-mo commented 3 months ago

The error is saying that the amount of memory available will not be able to handle such large context.

garyyang85 commented 3 months ago

@simon-mo Thanks for your reply. So my understanding is: If remove the params "--max-seq-len 63040 --max-model-len 30720" and the memory is enough, It will reach the most context length that the model support. For the params --max-seq-len and --max-model-len, it is a balance that we can configure to make sure the model can work with vllm with decreased context length, when the memory is not enough. Add the number is evaluated by vllm. Right?