As we all probably experience with high compression rates LLM's seem to come more repetitive.
One decoding trick that has been helpful when using HF's transformer library has been the no_repeat_ngram_size param.
Essentially only a certain number of repeated ngrams are permitted before generation is stopped stopping the endless repetitive loop of garbage when using longer contexts.
This is great in conjunction with rep penalties, as higher rep pens can significantly degrade output quality.
Thanks as always!! Also completely understandable if this is not a top priority or out of library scope!
Hey turboderp,
As we all probably experience with high compression rates LLM's seem to come more repetitive. One decoding trick that has been helpful when using HF's transformer library has been the no_repeat_ngram_size param.
The logit's processor for it can be found here: https://github.com/huggingface/transformers/blob/7b6324e18ee1b43d130a381fedddeb2b544e9e1a/src/transformers/generation/logits_process.py#L751
Essentially only a certain number of repeated ngrams are permitted before generation is stopped stopping the endless repetitive loop of garbage when using longer contexts. This is great in conjunction with rep penalties, as higher rep pens can significantly degrade output quality.
Thanks as always!! Also completely understandable if this is not a top priority or out of library scope!