noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
1.45k stars 66 forks source link

[Bug? Request?] Force `StringParser` to not split special tokens #126

Open aw632 opened 2 months ago

aw632 commented 2 months ago

Right now, StringParser's implementation is at the character level, so if you give it a special token as the target string, it can possibly generate the same string but with non-special tokens. If a flag could be added that prevents the target string from being split, it would be very helpful. I can help write the PR, but I am not sure where exactly to get started..I see the comment:

It is a debugging / learning tool to show how CharacterLevelParser works together with TokenizerPrefixTree to filter the allowed tokens (some of whom may contain multiple characters)"""

so I think it should be possible?

noamgat commented 1 month ago

The idea of LMFE is to support any sequence of tokens, whose string decoding is legal output. What you are requesting is essentially a violation of this. I'm not sure there's an elegant way to do this.