Serving inference requests based on caching

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

27.68k stars 4.08k forks source link

Serving inference requests based on caching #5796

Open nani1149 opened 3 months ago

nani1149 commented 3 months ago

Anything you want to discuss about vllm.

I am trying to see if we want to enable/disable cache that can be used of inference requests.For example one user asks a question on something how can that be used for figure similar kind of requests based on cache ..do I need to enable —prefix-caching flag.

second question where does vllm saves the cache ..is it in physical disk ?

Can someone please answer above questions

simon-mo commented 3 months ago

You need the turn on the flag --enable-prefix-caching for now. The cache is stored in GPU memory.