turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 274 forks source link

Constrained generation. What is needed? #265

Closed meditans closed 3 months ago

meditans commented 9 months ago

Hi, and thank you for the wonderful work on exllama and exllama2.

I was wondering if you could jot down in the abstract what it would take to implement constrained generation based on a grammar (like guidance or outlines), because I didn't find servers based on exllamav2 that allow such possibility.

Thanks!

turboderp commented 8 months ago

There's LM Format Enforcer which supports ExLlamaV2. But basically what you need is a filter that limits sampling to a set of tokens valid under a given grammar and then updates its state after every sample.

There's an interface for it in exllamav2/generator/filters/base.py along with an example "select" filter that constrains generation to a fixed set of strings (either case-sensitive or not). The tokenizer also provides a trie so you can efficiently narrow down the vocabulary to a constrained set in various ways.

I want to add more filters down the line, including regex and grammar filters, but mostly i've gotten stuck on finding an existing grammar library in order to keep the codebase lean. Obviously there's a number of libraries out there, but few that expose an option to evaluate a partial string and not error out at the end but return a list of valid continuations instead.

There are some fundamental problems with grammars, too, like the fact that not all rules can be evaluated left-to-right, and some more advanced grammars just don't work well in a generative framework, like programming languages with arbitrarily long comment fields and stuff.

I've further found that not all models work equally well under constrained generation. Specifically Mistral tends to be very certain about a smaller subset of tokens, which works fine under regular sampling but fails miserably when you end up sampling from the messier tail end of the distribution. Stuff like:

The girl's name was {Sally|Bob|refrigerator}

Here Llama2-7B will likely pick Sally because even if it doesn't happen to be in its top pick, it will still have a higher score than the other options given the context. Meanwhile, Mistral-7B will have a number of girl's names in its top 50 or so choices after The girl's name was, but if none of those is Sally, the model is just as likely to pick Bob or refrigerator.

But yeah, very interesting stuff and I wish I had infinite time to work on it and not a million other things distracting me. But I hope that's enough to go on if you want to experiment some.