Open Abhinay2323 opened 6 months ago
1) The KV cache utilization measurement includes only KVs with active requests running. So this is expected 2) We do not support LFU. We are open to alternate schedules but I think LRU makes more sense 3) This is not currently possible
prefix caching is working when i send a batch of requests but not across multiple requests
Can you tell how you came to this conclusion? Is it only because KV Cache occupancy rate dropped to 0? Becuase in vllm docs, it is clearly said that prefix caching works for "all future requests".
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Can anyone help me with these doubts
1)When i launch open ai compatible VLLM server
python3 -m vllm.entrypoints.openai.api_server --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ --max-model-len 32768 --gpu-memory-utilization 0.8 --quantization awq --enable-prefix-caching
prefix caching is working when i send a batch of requests but not across multiple requests,I observed the GPU KV cache is offloading to 0% immediately after a request is done,Am i missing anything here?2)Instead of LRU approach can LFU approach be implemented for prefix caching,what would be drawbacks if so?
3)I want to configure my KV cache with a list of prefixes at server startup and prevent offloading them until the server is stopped. Is this possible?
Thanks in advance