Tokenizer tree creation is incompatible with BPE pretokenization used for Llama3.

JoshC8C7 commented 3 weeks ago

Due to Llama3 using BPE pretokenization, some tokens (usually unicode characters like Ł) are represented in tokenizer.json (e.g. Ł is 253) but are mapped to a series of bytes instead, i.e. Ł is encoded as [129, 223] AKA ['Å','ģ'] (with Llama-3.2-3B-instruct).

It seems that when generating the tokenizer tree, the _build_regular_tokens_list makes use of decode on individual tokens, however calling decode on [253] (or indeed other tokens that start with Ł) yields a special (�) presumably due to the fact that pretokenization represents Ł as [129, 223] instead. This conversion is done in the convert_tokens_to_string part of decode - if we just decode into tokens rather than a string then the Ł survives. As decode goes to a string, no Ł is seen by the tree builder and thus it is not listed as a valid next character / child of the root node of the tree, leading to the above error.

Whilst you could switch to building the tree from decoding to tokens rather than strings, forcing the model to generate Ł via [253] instead of [129, 223] could degrade performance; 253 never seems to be generated by the model (by anecdotal evidence) so having it output could throw the model off.

Instead, we want it so that tokens are added to the tree in the way they would be generated; in most cases we can use (id, decode(id)) to get this (i.e. the current behaviour), but for Ł we get (253, �) which is discarded; we actually want to add leaves as the model would generate them, so when we see convert_ids_to_tokens([253]) yields [Ł] and that Ł is in the pretokenization dictionary, rather than adding (253, convert_tokens_to_string( [Ł])) we want to add a unary node (129, "") whose child is (223, "Ł"). We'd do the same thing every time we encounter Ł or a similarly pretokenized character.

JoshC8C7 commented 3 weeks ago

The more pressing implication of this is the first part - not having "Ł" in the root's children means its also absent from the tokenizer_alphabet and means a generated "Ł" in a string will be rejected by the StringParsingState causing the generation to terminate early and return invalid JSON. Due to pretokenization, the model can generate Ł and _apply_new_tokens handles the empty string produced when decoding 129 just fine, but then when adding 223 and comparing it to the last decode a new token is produced. This ability to get the output consisting of two tokens mismatches with the way the tree is generated - there is no way for two tokens to combine and form a character that is in the alphabet. The tokenizer_alphabet should thus contain any pretokenization inputs (characters like "Ł" that need more than one token to represent them).

noamgat commented 2 weeks ago

There is an inherent difficulty with supporting out-of-tokenizer characters in LMFE. If the LMFE approach consists of treating each token as a node on a prefix tree of the tokenizer's characters, these out of tokenizer characters requires a tree in which a character is a node where the tokens are the path to that character. I did not get the chance to approach this yet, but if anyone wants to have a go at it, it will be a great addition to LM Format Enforcer!

noamgat / lm-format-enforcer

Tokenizer tree creation is incompatible with BPE pretokenization used for Llama3. #146