[Bug? Request?] Force `StringParser` to not split special tokens

noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model

MIT License

1.45k stars 66 forks source link

Right now, StringParser's implementation is at the character level, so if you give it a special token as the target string, it can possibly generate the same string but with non-special tokens. If a flag could be added that prevents the target string from being split, it would be very helpful. I can help write the PR, but I am not sure where exactly to get started..I see the comment:

It is a debugging / learning tool to show how CharacterLevelParser works together with TokenizerPrefixTree to filter the allowed tokens (some of whom may contain multiple characters)"""

so I think it should be possible?

noamgat / lm-format-enforcer

[Bug? Request?] Force `StringParser` to not split special tokens #126