Open JoshC8C7 opened 5 months ago
If I understand correctly, this means that in the case of the GPT2 tokenizer, adding a token to a token sequence can actually remove a character from the output without the last token, and replace it with another character?
For tokenizers which use GPT2s decoders (and potentially any metaspace decoder, tbc), the calculation which determines which are the new characters to apply can be incorrect due to the length difference system used.
As a MWE, using opt-125m (which uses GPT2's tokenizer): When
state.current_word_tokens
ends with (e.g.) [1437] andnew_decoded
ends with [1437, 6], the two are decoded to strings of the same length by the decoder (" " and "," respectively). This means thatnew_characters
is an empty string, and the token corresponding to '6' (a comma) isn't applied, in turn meaning that the objectparser's state isn't advanced properly and another comma is produced. This double comma makes the json invalid as you can see in the below example.This example should reproduce it but you may need to set
MAX_CONSECUTIVE_WHITESPACES = 6
I'm aware that the alternative option to calculate the new tokens could be to decode just the new_token, but also know that this poses issues with the LlamaTokenizer (among others) deleting leading spaces.