outlines-dev / outlines

Structured Text Generation
https://outlines-dev.github.io/outlines/
Apache License 2.0
7.36k stars 377 forks source link

Allow Tokens to Span Multiple Terminals in CFG #684

Open brandonwillard opened 5 months ago

brandonwillard commented 5 months ago

Discussed in https://github.com/outlines-dev/outlines/discussions/683

Originally posted by **lapp0** January 23, 2024 ### What behavior of the library made you think about the improvement? Currently generated tokens must be part of a terminal, or a complete terminal. A token cannot start at one terminal and end at another. E.g. in the `gpt2` tokenizer, `{"` is a valid token. However if `{` and `"` are separate terminals, as in the case of a typical json grammar, `{` is allowed in the initial states `CFGFSM.allowed_token_ids(0)` but `{"` is not. This approach not only deviates technically from correct grammar representation, but also adversely affects generation quality. For example in the arithmetic grammar from `README.md`, using `mistralai/Mistral-7B-v0.2`, the most probable second token is ` +` (space-prefixed), however because space is a separate terminal this token isn't legal, it selects `+` instead. In scenarios like this, spaces, though grammatically valid and model-preferred, are seldom produced. This is because the model would have to select the space as a standalone token to incorporate any spaces. ### How would you like it to behave? Permit the generation of any token that complies with a grammar's production rules and is valid in the context of the preceding sequence of tokens, regardless of whether it spans multiple tokens. This will require careful engineering and benchmarking to ensure the new trie-of-`RegexFSM` described at the end of section 4.2 of the outlines paper works properly.
brandonwillard commented 5 months ago

Converting this back into an issue because it does mostly describe a bug-like situation with the current implementation. Design and approach proposals should take place in the discussion, though.

lapp0 commented 1 month ago

As discussed in https://github.com/outlines-dev/outlines/issues/796#issuecomment-2123709556 resolving this issue will involve ensuring the parsing issues below are resolved