turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.54k stars 274 forks source link

Support for no_repeat_ngram_size #198

Closed anujnayyar1 closed 1 week ago

anujnayyar1 commented 10 months ago

Hey turboderp,

As we all probably experience with high compression rates LLM's seem to come more repetitive. One decoding trick that has been helpful when using HF's transformer library has been the no_repeat_ngram_size param.

The logit's processor for it can be found here: https://github.com/huggingface/transformers/blob/7b6324e18ee1b43d130a381fedddeb2b544e9e1a/src/transformers/generation/logits_process.py#L751

Essentially only a certain number of repeated ngrams are permitted before generation is stopped stopping the endless repetitive loop of garbage when using longer contexts. This is great in conjunction with rep penalties, as higher rep pens can significantly degrade output quality.

Thanks as always!! Also completely understandable if this is not a top priority or out of library scope!

turboderp commented 1 week ago

You can essentially achieve this with DRY now, so I guess I can close this issue (: