sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
5.9k stars 477 forks source link

Implement prefix_cache #106

Closed pj-ml closed 9 months ago

pj-ml commented 9 months ago

Thanks so much for the work on this repo so far.

I think prefix caching could be very useful and I see that vLLM is also starting to support it for some architectures.

It looks like the BaseBackend.prefix_cache method still needs to be implemented:

    def cache_prefix(self, prefix_str: str):
        pass
merrymercy commented 9 months ago

@pj-ml . What you are referring to is the base class, so it does not have any actual implementation. For the SGLang runtime, it is implemented here https://github.com/sgl-project/sglang/blob/4ea92f83077ce70381528d7d1fcc565db7698d69/python/sglang/backend/runtime_endpoint.py#L35-L40

You do not need to do anything, just run your prompt once. The runtime will automatically cache it and reuse it for future requests. You can learn more about how it works in this blog post: https://lmsys.org/blog/2024-01-17-sglang/#backend-automatic-kv-cache-reuse-with-radixattention