turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.74k stars 215 forks source link

Add "min tokens" slider to webui #172

Closed EyeDeck closed 1 year ago

EyeDeck commented 1 year ago

I implemented what I was pondering here: https://github.com/turboderp/exllama/issues/166#issuecomment-1642024852

The idea is to add a minimum response token slider to the webUI, which works by watching the model's output, and when it starts impersonating someone other than the current speaker, generation is rewound, and any tokens that are found to begin an impersonation (e.g. "User" for "User:", or "An" for "Anon:", or even "U" if the model tries to be sneaky and get around the ban by writing it as two tokens like "U" "ser", or any variation thereof) are banned when the next token is (re)generated; unless we've already generated more tokens than "Min tokens", in which case it's handled normally (i.e. allowed to end).

I do not know if forcing longer output in exactly this way is a novel approach or not, but I have found that it really works quite well. The fear was that it might throw a fit and start hallucinating if I take its favorite token away, but it seems okay. I guess it's because the model has a lot of freedom to choose where to go immediately after a linebreak that won't devolve into insanity. Additionally, this code doesn't actually tend to trigger that often, because the model will quickly pick up the pattern based on context and start generating longer replies without needing to be explicitly disciplined.

The slider also allows for newlines until "Min tokens" is reached, even when "End on newline" is checked. I'm not sure if this was necessary, but it seemed like it would be less intuitive to have a slider that the model can basically arbitrarily ignore just by spitting out a linebreak (or EOS, since the webUI code already turns those into linebreaks).

Anyway I tried to polish everything up, the code isn't very long anyway, but feel free to ruthlessly tweak/refactor if it needs it, or tell me what I need to fix. I've tested it pretty thoroughly (at one point I must've had 30+ webUI tabs at once from repeated relaunches, because I couldn't be bothered looking up the flag to stop them) and it looks good on my end.

EyeDeck commented 1 year ago

Also I just realized that I don't actually have this ignoring EOS tokens, just preventing premature participant switches. I guess I was mostly testing with a lot of context, and at some point the model seems to stop wanting to spit out EOS tokens altogether, in favor of just (trying to) switch participants. And, while preventing participant switches works well in my testing, ignoring an EOS token will completely derail the output nearly every single time.

So, the "Min tokens" label is kind of misleading, since the model can still stop when it really, genuinely has no clue where to go. I don't want to make it ignore EOS tokens, because garbage output is worse than no output. I still can't think of anything succinct to change the label to, though.