turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.19k stars 234 forks source link

Addition of DRY: A modern repetition penalty that reliably prevents looping #447

Open awtrisk opened 1 month ago

awtrisk commented 1 month ago

Would it be worth it to add DRY as an alternative to the traditional repetition penalty? Users have reported that it actually works, and the PR on the ooba repo itself seems to be solid. It also has a llama.cpp PR. There seem to be barely any downsides to it too.

If it seems good, I can make the PR and implement it here.

turboderp commented 1 month ago

As far as I can tell it's basically just an n-gram penalty, but without combining it with a beam search it doesn't really solve offer a way to discourage repetitions before they occur. I.e. the model is allowed to start down the path of a repetition, and it's only somewhere along that path that the penalty kicks in, at which point it's impossible to turn back.

So I'm not too sure about it. Are there any thorough comparisons to other methods like increased temperature, skew, frequency penalty etc.?

awtrisk commented 1 month ago

AFAIK I don't think this was meant to discourage against repetition, but instead that when a pattern of repetition occurs, it can quickly cull it by biasing against the mean repeated tokens. Imo this is better than the current ways of preventing repetition we have.

@p-e-w may be able to shed more insight on things like comparisons, although I will be testing it with other samplers.

p-e-w commented 1 month ago

DRY is indeed an n-gram/sequence penalty, but it works a little differently from no_repeat_ngram_size and other proposals I've seen. The differences can be summarized as follows:

Simply put, it works. I and others have been running DRY for over two months now, and it's such a massive improvement over traditional repetition penalties that I can't imagine going back. Looping is a scourge, and the existing penalties are a cure that's almost worse than the disease, being noticeably detrimental to output quality. DRY is far better than the three flavors of RepPen at actually preventing repetition, while leaving standard sentence structure completely unaffected.

All samplers are hacks by definition (we should be able to just use the distribution from the model as-is). DRY was developed not primarily from theoretical considerations, but guided by constant real-world experimentation. Having generated and examined probably in excess of 200k tokens in well over 100 contexts by now using DRY, I can confidently say that it works, and enables results that cannot be replicated using any combination of the widely available samplers of today.

yamosin commented 1 month ago

Really looking forward to seeing it implemented on TabbyAPI

AgeOfAlgorithms commented 3 weeks ago

bump

Vhallo commented 2 weeks ago

The performance issues have been solved by now thanks to belladoreai, so might be worthwhile to integrate this now.

AgeOfAlgorithms commented 2 weeks ago

I just wanted to bring this comment by @belladoreai here for eveyone's convenience. It gives another good reason why no_repeat_ngram_size is unsuitable for stopping repetition. This was from their discussion with @p-e-w

For what it's worth, I've done a lot of experimentation with no_repeat_ngram_size in the past and I can confirm it's fairly useless in a chat context. It might be useful in other contexts, especially in contexts where the input is relatively small. But when a chat message history grows, using no_repeat_ngram_size typically leads to situations where the model is intentionally writing broken english (like writing "engglish" instead of "english"), where the brokenness of the language just grows more and more absurd over time. This seems to happen because in many cases (especially with smaller models) the model perceives repetitive output to be extremely likely - so likely, that even broken versions of the repetitive output appear more likely than some other alternative continuation of the text. So when we prevent the model from generating the exact same repetitive continuation to the text, it chooses to use a broken alternative version of the same repetitive text instead of choosing some more natural text.

I do not recommend using no_repeat_ngram_size except at very high values, if no other "circuit breaker" for repetition exists.