noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
1.42k stars 65 forks source link

Models using GPT2 tokenizer can output invalid JSON #89

Open JoshC8C7 opened 5 months ago

JoshC8C7 commented 5 months ago

For tokenizers which use GPT2s decoders (and potentially any metaspace decoder, tbc), the calculation which determines which are the new characters to apply can be incorrect due to the length difference system used.

As a MWE, using opt-125m (which uses GPT2's tokenizer): When state.current_word_tokens ends with (e.g.) [1437] and new_decoded ends with [1437, 6], the two are decoded to strings of the same length by the decoder (" " and "," respectively). This means that new_characters is an empty string, and the token corresponding to '6' (a comma) isn't applied, in turn meaning that the objectparser's state isn't advanced properly and another comma is produced. This double comma makes the json invalid as you can see in the below example.

This example should reproduce it but you may need to set MAX_CONSECUTIVE_WHITESPACES = 6

I'm aware that the alternative option to calculate the new tokens could be to decode just the new_token, but also know that this poses issues with the LlamaTokenizer (among others) deleting leading spaces.

noamgat commented 5 months ago

If I understand correctly, this means that in the case of the GPT2 tokenizer, adding a token to a token sequence can actually remove a character from the output without the last token, and replace it with another character?