Closed dnhkng closed 1 week ago
Well, inference is causal, so any token in the sequence is evaluated in the context of all the previous tokens in the sequence. So two sequences that share the same prefix can share entries in the key/value cache, but sharing a substring isn't enough. So in your first example batch:
[
long_prompt_prefix + abcd + prompt_suffix_1,
long_prompt_prefix + abcd + prompt_suffix_2,
long_prompt_prefix + abcd + prompt_suffix_3,
]
This would evaluate long_prompt_prefix + abcd + prompt_suffix_1
once, and then reuse the keys/values for long_prompt_prefix + abcd
when evaluating long_prompt_prefix + abcd + prompt_suffix_2
etc.
If long_prompt_prefix + abcd
amounts to more than one page of 256 tokens, the second and third entries in the batch will reference the same VRAM for those pages as well. So, assuming some lengths:
long_prompt_prefix
: 400 tokensabcd
200 tokensprompt_suffix_1
: 80 tokensprompt_suffix_2
: 81 tokensprompt_suffix_3
: 82 tokensThis gives a total cache use of:
The prefill is automatically optimized as:
Seq 1:
long_prompt_prefix + abcd + prompt_suffix_1
)Seq 2:
prompt_suffix_2
)Seq 3:
prompt_suffix_3
)So 843 tokens processed in total (optimal). Now, assuming you let those generations complete, and run the next batch:
[
long_prompt_prefix + bcde + prompt_suffix_1,
long_prompt_prefix + bcde + prompt_suffix_2,
long_prompt_prefix + bcde + prompt_suffix_3,
]
Assuming bcde
is also 200 tokens, the generator will allocate cache pages the same way, but it will only be able to directly reuse long_prompt_prefix
from the previous pass. So the prefill becomes:
Seq 1:
bcde + prompt_suffix_1
)Seq 2:
prompt_suffix_2
)Seq 3:
prompt_suffix_3
)So that's 443 new tokens to evaluate, which again is optimal. There's no way to reuse the suffixes, however, since they can only be evaluated in a context where keys/values for all previous tokens are either in the cache or being added to the cache in the same forward pass. I.e. prompt_suffix_3
following bcde
results in different keys and values than prompt_suffix_3
following abcd
.
If you're building a context gradually, like in a chatbot application, it makes sense to do as you suggest. If you already have long_prompt_prefix + abcd + prompt_suffix_1
somewhere in the cache and the next prompt is long_prompt_prefix + abcde + prompt_suffix_1
, you can reference/reuse up to long_prompt_prefix + abcd
.
There's a little more information here if you hadn't found that already.
@turboderp Thanks for the detailed reply! I will be moving my GLaDOS project over to ExllamaV2, once I get my idea implememted.
Currently, it is I think the first full voice chat bot that you can 'interrupt' while she is speaking, and she responds accordingly.
https://github.com/dnhkng/GlaDOS
But I was the chatbot to be able to proactively interrupt the user. That's a much more challenging problem, but your dynamic cache system might be what I need to let the chatbot think fast enough to get this done.
I went over the examples, and it seems Automagical! Really cool!
I have a use case that would benefit from this, but it's a bit different. There's a long pre-prompt, then a section of rotating prompt, and then a set of prompt suffixes.
A batch looks like this:
The next batch would be: [
Does the current caching system handle this? I'm guessing it's a no, but I wanted to clarify!
My other option would be then to extend the middle sections, and periodically clean up: i.e.