Closed dnhkng closed 8 months ago
Yes, that's exactly what you'd do. The generator also has this built in. It automatically scans the input IDs and skips inference up to the first position that's different from the cache.
@turboderp Thanks, I figured it out myself; I can't use the generator, and I am accessing the raw logits for a rating system. I've built a system for rating stories, which delayed my Franken model evaluation system a bit, but I had to overcome this issue: https://twitter.com/aparnadhinak/status/1748368364395721128 Now I use a very explicit prompt with descriptions of the rating system and examples, and I collect the probabilities of the tokens as weights. I get nice output like this, where the score is about 3.8 on a 0-9 rating system (I don't use 0-10, as the '1' token could be for 1 or 10, and I would need to do more work to get the real probabilities, however, assigning letters also seems to work for more range in the rating system).
I then make all possible Franken merges (e.g. layers = [:30] + [20:], for a 10-layer repeat in the middle), and plot the ratings. The point (0,0) is the baseline (no repeat blocks), and more blue means better results than the baseline, and red means worse.
Interestingly, there are regions where layer repeats totally destroy the output (early stop tokens, looping or repeating words forever) and some repeat blocks generate more interesting writing (greater use of descriptive language).
I had the exact same use case and was wondering how exllamav2 implemented it. Came across this paper today purely by happenstance, and can't tell if this is doing anything different: https://arxiv.org/abs/2402.05099
I have done lots of testing, and found some great results. Will publish and share the code and models soon!
I have done lots of testing, and found some great results. Will publish and share the code and models soon!
Looking forward to it! Thanks!
I have a use case where I need to inference many times with a long prompt prefix (think multishot promoting).
I have two questions;
1) When you process the initial prompt, it done as:
model.forward(ids[:, :-1], cache, preprocess_only = True)
Why don't we include the last token (:-1)?2) Can we do a second round of this? I.e Do a forward pass on the prompt prefix using preprocessing_only = True, then record the size of the cache.. Finally, process the remaining prompt and do some text generation. When the inference is done, reset the cache back to the size of the prompt prefix, like
cache.current_seq_len = prefix_cache_size
The goal it to increase the processing time and not have subsequent inferences affected by previous inferences.