[Usage]: ValueError: User-specified max_model_len (8192) is greater than the derived max_model_len (sliding_window=4096 or model_max_length=None in model's config.json).

Your current environment

python -m vllm.entrypoints.openai.api_server --model Vigostral-7B-Chat-AWQ --served-model-name Vigostral-7B-Chat-AWQ --max-model-len 8192 --quantization awq --enable-prefix-caching --disable-sliding-window

How would you like to use vllm

I want to launch vllm with Vigostral 7B Chat AWQ by enabling prefix caching. I have to disable at the same time disabling sliding windows to enable prefix caching.

This leads to a restriction for the setting of the max model len value, which equals to the default sliding window value, according to this line of code https://github.com/vllm-project/vllm/blob/5d5b4c5fe524c3b62453bba7ad4434a27c81317a/vllm/config.py#L1392

Is it possible to increase max model len above default sliding windows value when enabling prefix caching?

Many thanks for your help

vllm-project / vllm

[Usage]: ValueError: User-specified max_model_len (8192) is greater than the derived max_model_len (sliding_window=4096 or model_max_length=None in model's config.json). #6253

Your current environment

How would you like to use vllm