Quick question on Dynamic Generation

dnhkng commented 1 week ago

I went over the examples, and it seems Automagical! Really cool!

I have a use case that would benefit from this, but it's a bit different. There's a long pre-prompt, then a section of rotating prompt, and then a set of prompt suffixes.

A batch looks like this:

[
    long_prompt_prefix + abcd + prompt_suffix_1,
    long_prompt_prefix + abcd + prompt_suffix_2,
    long_prompt_prefix + abcd + prompt_suffix_3,
]

The next batch would be: [

    long_prompt_prefix + bcde + prompt_suffix_1,
    long_prompt_prefix + bcde + prompt_suffix_2,
    long_prompt_prefix + bcde + prompt_suffix_3,
]

Does the current caching system handle this? I'm guessing it's a no, but I wanted to clarify!

My other option would be then to extend the middle sections, and periodically clean up: i.e.

# batch 2
[
    long_prompt_prefix + abcde + prompt_suffix_1,
    long_prompt_prefix + abcde + prompt_suffix_2,
    long_prompt_prefix + abcde + prompt_suffix_3,
]

batch 3
[
    long_prompt_prefix + abcdef + prompt_suffix_1,
    long_prompt_prefix + abcdef + prompt_suffix_2,
    long_prompt_prefix + abcdef + prompt_suffix_3,
]
...  # getting full!
# cleanup
[
    long_prompt_prefix + qrst + prompt_suffix_1,
    long_prompt_prefix + qrst + prompt_suffix_2,
    long_prompt_prefix + qrst + prompt_suffix_3,
]

turboderp commented 1 week ago

Well, inference is causal, so any token in the sequence is evaluated in the context of all the previous tokens in the sequence. So two sequences that share the same prefix can share entries in the key/value cache, but sharing a substring isn't enough. So in your first example batch:

[
    long_prompt_prefix + abcd + prompt_suffix_1,
    long_prompt_prefix + abcd + prompt_suffix_2,
    long_prompt_prefix + abcd + prompt_suffix_3,
]

This would evaluate long_prompt_prefix + abcd + prompt_suffix_1 once, and then reuse the keys/values for long_prompt_prefix + abcd when evaluating long_prompt_prefix + abcd + prompt_suffix_2 etc.

If long_prompt_prefix + abcd amounts to more than one page of 256 tokens, the second and third entries in the batch will reference the same VRAM for those pages as well. So, assuming some lengths:

long_prompt_prefix: 400 tokens
abcd 200 tokens
prompt_suffix_1: 80 tokens
prompt_suffix_2: 81 tokens
prompt_suffix_3: 82 tokens
300 new tokens requested for each seq

This gives a total cache use of:

512 tokens (two pages) shared between all three seqs
512 tokens (two pages) for seq 1: 400 + 200 + 80 + 300 - 512, rounded up
512 tokens (two pages) for seq 2: 400 + 200 + 81 + 300 - 512, rounded up
512 tokens (two pages) for seq 3: 400 + 200 + 82 + 300 - 512, rounded up

The prefill is automatically optimized as:

Seq 1:

680 tokens forward pass (long_prompt_prefix + abcd + prompt_suffix_1)

Seq 2:

512 keys/values referenced from seq 1
88 keys/values copied from seq 1
81 tokens forward pass (prompt_suffix_2)

Seq 3:

512 keys/values referenced from seq 1
88 keys/values copied from seq 1
82 tokens forward pass (prompt_suffix_3)

So 843 tokens processed in total (optimal). Now, assuming you let those generations complete, and run the next batch:

[
    long_prompt_prefix + bcde + prompt_suffix_1,
    long_prompt_prefix + bcde + prompt_suffix_2,
    long_prompt_prefix + bcde + prompt_suffix_3,
]

Assuming bcde is also 200 tokens, the generator will allocate cache pages the same way, but it will only be able to directly reuse long_prompt_prefix from the previous pass. So the prefill becomes:

Seq 1:

256 keys/values referenced from old pages
144 keys/values copied from old pages
280 tokens forward pass (bcde + prompt_suffix_1)

Seq 2:

512 keys/values referenced from seq 1
88 keys/values copied from seq 1
81 tokens forward pass (prompt_suffix_2)

Seq 3:

512 keys/values referenced from seq 1
88 keys/values copied from seq 1
82 tokens forward pass (prompt_suffix_3)

So that's 443 new tokens to evaluate, which again is optimal. There's no way to reuse the suffixes, however, since they can only be evaluated in a context where keys/values for all previous tokens are either in the cache or being added to the cache in the same forward pass. I.e. prompt_suffix_3 following bcde results in different keys and values than prompt_suffix_3 following abcd.

If you're building a context gradually, like in a chatbot application, it makes sense to do as you suggest. If you already have long_prompt_prefix + abcd + prompt_suffix_1 somewhere in the cache and the next prompt is long_prompt_prefix + abcde + prompt_suffix_1, you can reference/reuse up to long_prompt_prefix + abcd.

There's a little more information here if you hadn't found that already.

dnhkng commented 1 week ago

@turboderp Thanks for the detailed reply! I will be moving my GLaDOS project over to ExllamaV2, once I get my idea implememted.

Currently, it is I think the first full voice chat bot that you can 'interrupt' while she is speaking, and she responds accordingly.

https://github.com/dnhkng/GlaDOS

But I was the chatbot to be able to proactively interrupt the user. That's a much more challenging problem, but your dynamic cache system might be what I need to let the chatbot think fast enough to get this done.

turboderp / exllamav2

Quick question on Dynamic Generation #507