Open chenchunhui97 opened 6 months ago
AFAIK, vLLM will reuse the KV cache of one token only if all preceding token matches. So
system prompt + 'you are a clever assistant'
will be reused.Sadly I am not familiar with --chat-template
so I am not sure about the last question.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
How would you like to use vllm
I want to know how (if it can) vllm ensure the correctness when reusing KV cache under high-concurrency and system prompt difference? I noticed hash prefix tree is used in prefix-caching. If I launch a server with enabling prefix caching, considering the following three cases:
case 1:
some user called the service with identical system prompt but different usr prompt (many users with high concurrency and multi-session), will the KV cache be reused with cross-requests? for example, request 1 from session one: system prompt + usr prompt is: 'you are a clever assistant' + 'please introduce CNN to me.'
request 2 from session two: system prompt + usr prompt is: 'you are a clever assistant' + 'what is the most popular interest place in China?' will request 2 reuse partial KV cache of request 1 ? as there are some tokens are identical.
case 2:
some user called the service with different system prompt and usr prompt(because they are under different tasks) request 1 from session one: system prompt + usr prompt is: 'you are developed by AIRR,' + 'please recommend some short movie to me.'
request 2 from session two: system prompt + usr prompt is: 'you are a clever code completion assistant' + 'aaaa xxx bbbb' (assume it is a task about code completion) and in the further chat of the two session, will KV cache be mismatched between two sessions?
case 3:
only one user called the service in ONE session but with different system prompt and usr prompt. request 1 from session one: system prompt + usr prompt is: 'you are developed by AIRR,' + 'please recommend some short movie to me.'
request 2 from same session : system prompt + usr prompt is: 'you are a clever code chat assistant' + 'what is static method in python?'
will KV cache of request 1 be erased when request 2 occurs? or just not used but keeped according eviction policy?
one more question, is it possible to set more than one system prompt in one service? I noticed that there is an argument
--chat-template
when launching service.