Implement prefix_cache - Githubissues

sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.

Apache License 2.0

5.9k stars 477 forks source link

@pj-ml . What you are referring to is the base class, so it does not have any actual implementation. For the SGLang runtime, it is implemented here https://github.com/sgl-project/sglang/blob/4ea92f83077ce70381528d7d1fcc565db7698d69/python/sglang/backend/runtime_endpoint.py#L35-L40

You do not need to do anything, just run your prompt once. The runtime will automatically cache it and reuse it for future requests. You can learn more about how it works in this blog post: https://lmsys.org/blog/2024-01-17-sglang/#backend-automatic-kv-cache-reuse-with-radixattention

sgl-project / sglang