vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.87k stars 3.77k forks source link

Best server cmd for mistralai/Mistral-7B-v0.1 #3781

Open sshleifer opened 5 months ago

sshleifer commented 5 months ago
export MODEL=mistralai/Mistral-7B-v0.1
python3 -m vllm.entrypoints.openai.api_server --model $MODEL \
    --tensor-parallel-size=1 \
    --enable-prefix-caching --max-model-len=4096 --trust-remote-code | tee server_mistral.log &

raises NotImplementedError: Sliding window is not allowed with prefix caching enabled!

Is there a way to turn off sliding window and keep prefix caching?

(More generally is there a list of commands to serve common models efficiently?)

robertgshaw2-neuralmagic commented 5 months ago

I do not believe there is currently a way to disable sliding window, but I think this is something we should add

ssmi153 commented 1 month ago

You can disable the sliding window by using --disable-sliding-window . For mistral, as you've done, you'll need to restrict the model to a context window of 4096 tokens to do this.

@robertgshaw2-neuralmagic considering that prefix caching is by definition focusing the early portion of the prompt whereas the sliding window in mistral only kicks in after 4096 tokens, do you think it might be possible to enable a prefix cache that only looked at the first 4096 tokens of a prompt so there wasn't a clash? That would be the best of both worlds here.