vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.4k stars 4.6k forks source link

[Usage]: prefix-caching #4670

Open chenchunhui97 opened 6 months ago

chenchunhui97 commented 6 months ago

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

I want to know how (if it can) vllm ensure the correctness when reusing KV cache under high-concurrency and system prompt difference? I noticed hash prefix tree is used in prefix-caching. If I launch a server with enabling prefix caching, considering the following three cases:

case 1:

some user called the service with identical system prompt but different usr prompt (many users with high concurrency and multi-session), will the KV cache be reused with cross-requests? for example, request 1 from session one: system prompt + usr prompt is: 'you are a clever assistant' + 'please introduce CNN to me.'
request 2 from session two: system prompt + usr prompt is: 'you are a clever assistant' + 'what is the most popular interest place in China?' will request 2 reuse partial KV cache of request 1 ? as there are some tokens are identical.

case 2:

some user called the service with different system prompt and usr prompt(because they are under different tasks) request 1 from session one: system prompt + usr prompt is: 'you are developed by AIRR,' + 'please recommend some short movie to me.'
request 2 from session two: system prompt + usr prompt is: 'you are a clever code completion assistant' + 'aaaa xxx bbbb' (assume it is a task about code completion) and in the further chat of the two session, will KV cache be mismatched between two sessions?

case 3:

only one user called the service in ONE session but with different system prompt and usr prompt. request 1 from session one: system prompt + usr prompt is: 'you are developed by AIRR,' + 'please recommend some short movie to me.'
request 2 from same session : system prompt + usr prompt is: 'you are a clever code chat assistant' + 'what is static method in python?'
will KV cache of request 1 be erased when request 2 occurs? or just not used but keeped according eviction policy?

one more question, is it possible to set more than one system prompt in one service? I noticed that there is an argument --chat-template when launching service.

KuntaiDu commented 6 months ago

AFAIK, vLLM will reuse the KV cache of one token only if all preceding token matches. So

  1. Yes. The KV cache of system prompt + 'you are a clever assistant' will be reused.
  2. No. If the system prompt is not identical, the preceding token does not match and the KV cache of all future tokens will not be reused.
  3. It will not be immediately erased but may be evicted sometime in the future.

Sadly I am not familiar with --chat-template so I am not sure about the last question.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!