Closed cmunna0052 closed 6 days ago
The dynamic generator automatically automatically manages the cache and reuses the results of previous jobs, for as many input tokens as it can match between an old and a new job. So if you're building a context bit by bit, all you have to do is make sure you're not editing the past. For instance:
ABCD
-> generator ingests ABCD
and returns EFG
. Cache contains ABCDEFG
ABCDEFGHIJK
-> generator reuses ABCDEFG
from the previous job, ingests HIJK
and returns LMNOP
. Cache now contains ABCDEFGHIJKLMNOP
.ABCDEFGHI123
-> generator reuses ABCDEFGHI
from the cache, ingests 123
and returns 456
. Cache now has an implicit tree structure containing ABCDEFGHI
branching into either JKLMNOP
and 123456
.At this point the cache would be able to resume generation from any portion of ABCDEFGHIJKLMNOP
(starting with A
) or ABCDEFGHI123456
. You could also launch multiple jobs at once starting with ABCDE
and they would all reference the same portion of the cache (deduplication).
So basically, it's all automated. If you're building a context bit by bit, just start each new generation with the entire context-so-far and the generator will only process the bits that don't line up with what's already been processed. When the cache fills up, the oldest (least recently referenced pages) are evicted first.
There's a bit more info here.
That is super helpful, thanks!
Problem
I am very new to exllamav2, so my apologies if this feature can already be achieved through other means, but the goal is to produce a sequence of generations that progressively fill in a template. I supply an initial prompt x0, generate x1 tokens, add that to the prompt, add x2 more predefined tokens to the prompt, supply x0 + x1 + x2 to the model to generate x1 more tokens, and then repeat. This whole process is repeated for a number of examples.
The question is how to ensure that I can properly utilize the kv-cache during this process. In Huggingface, it is straightforward to achieve this with the following loop:
This works very well, since the cache is supplied directly to generate() and then retrieved immediately in output_dict.
I'm not sure how I might achieve something similar in exllama. I am currently trying to set it up in the following way:
However, I have no idea if it is possible to ensure the cache is doing what I want it to, or if there is a straightforward way to update the ExLlamaV2Cache object with the results of each generation, and then reset it between examples.
Solution
A simple method to ensure the cache can be passed to and from a generation process.
Alternatives
No response
Explanation
Makes use of the cache much easier for large numbers of inferences with specific formats.
Examples
No response
Additional context
No response
Acknowledgements