microsoft / llguidance

Low-level Guidance Parser
MIT License
30 stars 7 forks source link

only allowing valid tokenizations in grammars #1

Open mmoskal opened 4 months ago

mmoskal commented 4 months ago

See https://vivien000.github.io/blog/journal/llm-decoding-with-regex-constraints.html and https://github.com/vivien000/regex-constrained-decoding/blob/main/technical_appendix.pdf

Thoughts (unorganized):

The tokens we most need to discard will be along the forced path, for example after " the , is forced. Note that if the grammar allows white space between " and ,, there is no forced paths and moreover the token " should be still allowed (unless there are tokens ", "\n, "\t etc covering all of white space; but I think this is very unlikely).

In toktrie walk, if we encounter a forced byte, we go into forced mode where we just chase all forced bytes. The first token we find on this path we put on some list. We do not add any of these tokens to the allow set.

Then, after token trie walk, for every token on this list we re-create the forced byte string, tokenize, chop excessive tokens, and add the first token from tokenization to allow set and remaining tokens (if any) as conditional splice.

Transferred from https://github.com/hudson-ai/guidance/issues/13