Models using GPT2 tokenizer can output invalid JSON

For tokenizers which use GPT2s decoders (and potentially any metaspace decoder, tbc), the calculation which determines which are the new characters to apply can be incorrect due to the length difference system used.

As a MWE, using opt-125m (which uses GPT2's tokenizer): When state.current_word_tokens ends with (e.g.) [1437] and new_decoded ends with [1437, 6], the two are decoded to strings of the same length by the decoder (" " and "," respectively). This means that new_characters is an empty string, and the token corresponding to '6' (a comma) isn't applied, in turn meaning that the objectparser's state isn't advanced properly and another comma is produced. This double comma makes the json invalid as you can see in the below example.

This example should reproduce it but you may need to set MAX_CONSECUTIVE_WHITESPACES = 6

I'm aware that the alternative option to calculate the new tokens could be to decode just the new_token, but also know that this poses issues with the LlamaTokenizer (among others) deleting leading spaces.

noamgat / lm-format-enforcer

Models using GPT2 tokenizer can output invalid JSON #89