turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 273 forks source link

Question on repeating a prompt #325

Closed dnhkng closed 8 months ago

dnhkng commented 8 months ago

I have a use case where I need to inference many times with a long prompt prefix (think multishot promoting).

I have two questions;

1) When you process the initial prompt, it done as: model.forward(ids[:, :-1], cache, preprocess_only = True) Why don't we include the last token (:-1)?

2) Can we do a second round of this? I.e Do a forward pass on the prompt prefix using preprocessing_only = True, then record the size of the cache.. Finally, process the remaining prompt and do some text generation. When the inference is done, reset the cache back to the size of the prompt prefix, like cache.current_seq_len = prefix_cache_size

The goal it to increase the processing time and not have subsequent inferences affected by previous inferences.

turboderp commented 8 months ago

Yes, that's exactly what you'd do. The generator also has this built in. It automatically scans the input IDs and skips inference up to the first position that's different from the cache.

dnhkng commented 8 months ago

@turboderp Thanks, I figured it out myself; I can't use the generator, and I am accessing the raw logits for a rating system. I've built a system for rating stories, which delayed my Franken model evaluation system a bit, but I had to overcome this issue: https://twitter.com/aparnadhinak/status/1748368364395721128 output2 Now I use a very explicit prompt with descriptions of the rating system and examples, and I collect the probabilities of the tokens as weights. I get nice output like this, where the score is about 3.8 on a 0-9 rating system (I don't use 0-10, as the '1' token could be for 1 or 10, and I would need to do more work to get the real probabilities, however, assigning letters also seems to work for more range in the rating system).

I then make all possible Franken merges (e.g. layers = [:30] + [20:], for a 10-layer repeat in the middle), and plot the ratings. The point (0,0) is the baseline (no repeat blocks), and more blue means better results than the baseline, and red means worse. output

Interestingly, there are regions where layer repeats totally destroy the output (early stop tokens, looping or repeating words forever) and some repeat blocks generate more interesting writing (greater use of descriptive language).

fblissjr commented 7 months ago

I had the exact same use case and was wondering how exllamav2 implemented it. Came across this paper today purely by happenstance, and can't tell if this is doing anything different: https://arxiv.org/abs/2402.05099

dnhkng commented 7 months ago

I have done lots of testing, and found some great results. Will publish and share the code and models soon!

fblissjr commented 7 months ago

I have done lots of testing, and found some great results. Will publish and share the code and models soon!

Looking forward to it! Thanks!