turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.18k stars 233 forks source link

Quick question on Dynamic Generation #507

Closed dnhkng closed 1 week ago

dnhkng commented 1 week ago

I went over the examples, and it seems Automagical! Really cool!

I have a use case that would benefit from this, but it's a bit different. There's a long pre-prompt, then a section of rotating prompt, and then a set of prompt suffixes.

A batch looks like this:

[
    long_prompt_prefix + abcd + prompt_suffix_1,
    long_prompt_prefix + abcd + prompt_suffix_2,
    long_prompt_prefix + abcd + prompt_suffix_3,
]

The next batch would be: [

    long_prompt_prefix + bcde + prompt_suffix_1,
    long_prompt_prefix + bcde + prompt_suffix_2,
    long_prompt_prefix + bcde + prompt_suffix_3,
]

Does the current caching system handle this? I'm guessing it's a no, but I wanted to clarify!

My other option would be then to extend the middle sections, and periodically clean up: i.e.

# batch 2
[
    long_prompt_prefix + abcde + prompt_suffix_1,
    long_prompt_prefix + abcde + prompt_suffix_2,
    long_prompt_prefix + abcde + prompt_suffix_3,
]

batch 3
[
    long_prompt_prefix + abcdef + prompt_suffix_1,
    long_prompt_prefix + abcdef + prompt_suffix_2,
    long_prompt_prefix + abcdef + prompt_suffix_3,
]
...  # getting full!
# cleanup
[
    long_prompt_prefix + qrst + prompt_suffix_1,
    long_prompt_prefix + qrst + prompt_suffix_2,
    long_prompt_prefix + qrst + prompt_suffix_3,
]
turboderp commented 1 week ago

Well, inference is causal, so any token in the sequence is evaluated in the context of all the previous tokens in the sequence. So two sequences that share the same prefix can share entries in the key/value cache, but sharing a substring isn't enough. So in your first example batch:

[
    long_prompt_prefix + abcd + prompt_suffix_1,
    long_prompt_prefix + abcd + prompt_suffix_2,
    long_prompt_prefix + abcd + prompt_suffix_3,
]

This would evaluate long_prompt_prefix + abcd + prompt_suffix_1 once, and then reuse the keys/values for long_prompt_prefix + abcd when evaluating long_prompt_prefix + abcd + prompt_suffix_2 etc.

If long_prompt_prefix + abcd amounts to more than one page of 256 tokens, the second and third entries in the batch will reference the same VRAM for those pages as well. So, assuming some lengths:

This gives a total cache use of:

The prefill is automatically optimized as:

Seq 1:

Seq 2:

Seq 3:

So 843 tokens processed in total (optimal). Now, assuming you let those generations complete, and run the next batch:

[
    long_prompt_prefix + bcde + prompt_suffix_1,
    long_prompt_prefix + bcde + prompt_suffix_2,
    long_prompt_prefix + bcde + prompt_suffix_3,
]

Assuming bcde is also 200 tokens, the generator will allocate cache pages the same way, but it will only be able to directly reuse long_prompt_prefix from the previous pass. So the prefill becomes:

Seq 1:

Seq 2:

Seq 3:

So that's 443 new tokens to evaluate, which again is optimal. There's no way to reuse the suffixes, however, since they can only be evaluated in a context where keys/values for all previous tokens are either in the cache or being added to the cache in the same forward pass. I.e. prompt_suffix_3 following bcde results in different keys and values than prompt_suffix_3 following abcd.

If you're building a context gradually, like in a chatbot application, it makes sense to do as you suggest. If you already have long_prompt_prefix + abcd + prompt_suffix_1 somewhere in the cache and the next prompt is long_prompt_prefix + abcde + prompt_suffix_1, you can reference/reuse up to long_prompt_prefix + abcd.

There's a little more information here if you hadn't found that already.

dnhkng commented 1 week ago

@turboderp Thanks for the detailed reply! I will be moving my GLaDOS project over to ExllamaV2, once I get my idea implememted.

Currently, it is I think the first full voice chat bot that you can 'interrupt' while she is speaking, and she responds accordingly.

https://github.com/dnhkng/GlaDOS

But I was the chatbot to be able to proactively interrupt the user. That's a much more challenging problem, but your dynamic cache system might be what I need to let the chatbot think fast enough to get this done.