there is an automaton that recognizes all valid sequences of tokens, and moreover this automaton only depends on the previous token; that is, there is a function from tokens that tells you the set of valid next tokens
by itself it's probably not that useful, since the automaton doesn't exclude enough states, for example if the regex is for a JSON string, and the previous token is foo then the set of allowed next tokens includes " and ",; however after sampling " it will not allow , anymore (so in other words we would have to look ahead in this automaton)
[ ] implement .forced_byte() method in derivre
[ ] use this for cheap .forced_byte() impl in llguidance
[ ] while walking token trie, remember all forced paths (there shouldn't be too many of them)
The tokens we most need to discard will be along the forced path, for example after " the , is forced. Note that if the grammar allows white space between " and ,, there is no forced paths and moreover the token " should be still allowed (unless there are tokens ", "\n, "\t etc covering all of white space; but I think this is very unlikely).
In toktrie walk, if we encounter a forced byte, we go into forced mode where we just chase all forced bytes. The first token we find on this path we put on some list. We do not add any of these tokens to the allow set.
Then, after token trie walk, for every token on this list we re-create the forced byte string, tokenize, chop excessive tokens, and add the first token from tokenization to allow set and remaining tokens (if any) as conditional
splice.
See https://vivien000.github.io/blog/journal/llm-decoding-with-regex-constraints.html and https://github.com/vivien000/regex-constrained-decoding/blob/main/technical_appendix.pdf
Thoughts (unorganized):
there is an automaton that recognizes all valid sequences of tokens, and moreover this automaton only depends on the previous token; that is, there is a function from tokens that tells you the set of valid next tokens
by itself it's probably not that useful, since the automaton doesn't exclude enough states, for example if the regex is for a JSON string, and the previous token is
foo
then the set of allowed next tokens includes"
and",
; however after sampling"
it will not allow,
anymore (so in other words we would have to look ahead in this automaton)[ ] implement
.forced_byte()
method inderivre
[ ] use this for cheap
.forced_byte()
impl inllguidance
[ ] while walking token trie, remember all forced paths (there shouldn't be too many of them)
The tokens we most need to discard will be along the forced path, for example after
"
the,
is forced. Note that if the grammar allows white space between"
and,
, there is no forced paths and moreover the token"
should be still allowed (unless there are tokens"
,"\n
,"\t
etc covering all of white space; but I think this is very unlikely).In toktrie walk, if we encounter a forced byte, we go into forced mode where we just chase all forced bytes. The first token we find on this path we put on some list. We do not add any of these tokens to the allow set.
Then, after token trie walk, for every token on this list we re-create the forced byte string, tokenize, chop excessive tokens, and add the first token from tokenization to allow set and remaining tokens (if any) as conditional splice.
Transferred from https://github.com/hudson-ai/guidance/issues/13