turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.23k stars 238 forks source link

Token healing (under 40 LOC) #260

Closed ahmed-moubtahij closed 6 months ago

ahmed-moubtahij commented 6 months ago

Feature request

Token healing rectifies the token boundary bias in greedy tokenization. It does this by trimming and regrowing the prompt to better align with the model's tokenizer, thus enhancing generation quality. The improvement is clearest with completion models.

Token boundary bias is a silent performance killer that doesn't seem very well known. It has clear impact on completion quality, though I'm not sure where it would fit as a exllamav2 feature.

A more thorough explanation of the problem: The Art of Prompt Design: Prompt Boundaries and Token Healing | by Scott Lundberg.

Motivation

Given a completion prompt with a partial url ending with :, the model might have seen the expected completion :// as a single token in training. However, the prompt's tail token : tells it that the next token is not //, and so it generates a wrong completion. Such errors compound in auto-regressive language models.

Your contribution

My implementation (under 40 LOC): https://github.com/Ayenem/TokenHealer/tree/main

ahmed-moubtahij commented 6 months ago

Found out about existing impl in exllamav2.