vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.51k stars 3.88k forks source link

[Usage]: Prefix caching in VLLM #5176

Open Abhinay2323 opened 3 months ago

Abhinay2323 commented 3 months ago

Can anyone help me with these doubts

1)When i launch open ai compatible VLLM server python3 -m vllm.entrypoints.openai.api_server --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ --max-model-len 32768 --gpu-memory-utilization 0.8 --quantization awq --enable-prefix-caching prefix caching is working when i send a batch of requests but not across multiple requests,I observed the GPU KV cache is offloading to 0% immediately after a request is done,Am i missing anything here?

2)Instead of LRU approach can LFU approach be implemented for prefix caching,what would be drawbacks if so?

3)I want to configure my KV cache with a list of prefixes at server startup and prevent offloading them until the server is stopped. Is this possible?

Thanks in advance

robertgshaw2-neuralmagic commented 3 months ago

1) The KV cache utilization measurement includes only KVs with active requests running. So this is expected 2) We do not support LFU. We are open to alternate schedules but I think LRU makes more sense 3) This is not currently possible

summersonnn commented 3 weeks ago

prefix caching is working when i send a batch of requests but not across multiple requests

Can you tell how you came to this conclusion? Is it only because KV Cache occupancy rate dropped to 0? Becuase in vllm docs, it is clearly said that prefix caching works for "all future requests".