How to increase vllm scheduler promt limit?

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

25.69k stars 3.74k forks source link

How to increase vllm scheduler promt limit? #2737

Closed hanswang1 closed 4 weeks ago

hanswang1 commented 6 months ago

Hi,

I am using FastChat vicuna-7b-v1.5 model with vllm worker. When chatting with back-end, I encountered prompt limitation in scheduler.py.

MicrosoftTeams-image (19)

May I know how to increase the number of prompt limitation in scheduler.py?

LiuXiaoxuanPKU commented 6 months ago

Based on this, vicuna-7b-v1.5 only supports 4K context length. vllm checks max model length (reading from model config) and throws the warning here.

hanswang1 commented 6 months ago

Based on this, vicuna-7b-v1.5 only supports 4K context length. vllm checks max model length (reading from model config) and throws the warning here.

I changed vicuna-7b-v1.5 to vicuna-7b-v1.5-16k, and change the arguments like below:

args: ["-m", "fastchat.serve.vllm_worker", "--model-path", "lmsys/vicuna-7b-v1.5-16k", "--worker-address", "http://fastchat-model-worker:21002", "--controller-address", "http://svc-fc-controller:21001", "--host", "0.0.0.0", "--port", "21002", "--gpu_memory_utilization", "0.998", "--max-model-len", "8192", "--max-num-batched-tokens", "8192"]

The prompt limitation has been pass, however I still got another error: "Input prompt (4644 tokens) is too long and exceeds the capacity of block_manager" The log is below:

May I know how to solve this problem?

hmellor commented 4 months ago

This happens if self.num_total_gpu_blocks - num_required_blocks < self.watermark_blocks is True.

https://github.com/vllm-project/vllm/blob/b7782002e1da25de77e0b1890ff8b72dd4df917c/vllm/core/block_manager_v1.py#L258-L260

chris-aeviator commented 1 month ago

this also happens for seeminlgy no reason. I'm trying https://huggingface.co/HuggingFaceTB/SmolLM-1.7B which does not have a max_length set to 2048, still vllm doesnt run this

speed input: 14316.32 toks/s, output: 1735.28 toks/s] WARNING 07-17 09:36:39 scheduler.py:687] Input prompt (2076 tokens) is too long and exceeds limit of 2048

hmellor commented 4 weeks ago

which does not have a max_length set to 2048

Yes it does

https://huggingface.co/HuggingFaceTB/SmolLM-1.7B/blob/72784557ea07bfbabb4cfe3b5e34c49e047708bd/config.json#L14