vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.71k stars 3.91k forks source link

[Usage]: ValueError: User-specified max_model_len (8192) is greater than the derived max_model_len (sliding_window=4096 or model_max_length=None in model's config.json). #6253

Open mfournioux opened 2 months ago

mfournioux commented 2 months ago

Your current environment

python -m vllm.entrypoints.openai.api_server --model Vigostral-7B-Chat-AWQ --served-model-name Vigostral-7B-Chat-AWQ --max-model-len 8192 --quantization awq --enable-prefix-caching --disable-sliding-window

How would you like to use vllm

I want to launch vllm with Vigostral 7B Chat AWQ by enabling prefix caching. I have to disable at the same time disabling sliding windows to enable prefix caching.

This leads to a restriction for the setting of the max model len value, which equals to the default sliding window value, according to this line of code https://github.com/vllm-project/vllm/blob/5d5b4c5fe524c3b62453bba7ad4434a27c81317a/vllm/config.py#L1392

Is it possible to increase max model len above default sliding windows value when enabling prefix caching?

Many thanks for your help

waylonli commented 3 weeks ago

Same issue.