Closed pj-ml closed 9 months ago
@pj-ml . What you are referring to is the base class, so it does not have any actual implementation. For the SGLang runtime, it is implemented here https://github.com/sgl-project/sglang/blob/4ea92f83077ce70381528d7d1fcc565db7698d69/python/sglang/backend/runtime_endpoint.py#L35-L40
You do not need to do anything, just run your prompt once. The runtime will automatically cache it and reuse it for future requests. You can learn more about how it works in this blog post: https://lmsys.org/blog/2024-01-17-sglang/#backend-automatic-kv-cache-reuse-with-radixattention
Thanks so much for the work on this repo so far.
I think prefix caching could be very useful and I see that vLLM is also starting to support it for some architectures.
It looks like the BaseBackend.prefix_cache method still needs to be implemented: