turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.2k stars 235 forks source link

feature request: Radix Cache #523

Closed isamu-isozaki closed 5 days ago

isamu-isozaki commented 5 days ago

Hi! Awesome library. For the new update with the Dynamic Jobs, do you have plans on adding Radix Cache? It was an idea proposed in SGLang and the basic idea is to save KV cache of previous prompts and re-using them if a new prompt starts with a previously recorded prompt. This is pretty useful for chain of thought etc like so image I do have an implementation with outlines+previous exllama2 here

I can try contributing too although I'm not too familiar with the inner parts of this library yet. But do you think this can be a feature in exllama2?

turboderp commented 5 days ago

This has been a feature of the streaming generator for a long time, for a single prompt. The dynamic generator extends it to dynamically caching multiple prompts as well. See here.

isamu-isozaki commented 5 days ago

@turboderp oh awesome tysm!