vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.73k stars 3.91k forks source link

[Usage]: Does Prefix Caching currently support offloading to the CPU? #6676

Open wjj19950828 opened 1 month ago

wjj19950828 commented 1 month ago

Usage

Does Prefix Caching currently support offloading to the CPU?

If not, is there a plan to support it? Thanks~

simon-mo commented 1 month ago

No. And yes, @KuntaiDu can add more.

wjj19950828 commented 1 month ago

@KuntaiDu Do you have any suggestions on Prefix Caching offload to CPU? Thanks~

KuntaiDu commented 1 month ago

Yes, we have put some thought on supporting CPU/disk/database KV cache offloading. I am busy with profiling vllm's performance bottleneck recently (#6794 ), but I will circle back to KV cache offloading in next week.