Closed xenova closed 1 year ago
As I come across more models which have this problem, I'll update the following list:
GPT2Tokenizer
, like EleutherAI/gpt-neo-125mMost if not all Roberta based models and GPT2 based tokenizer models should be considered
BPE algorithm based tokenizers that utilize the ByteLevel() pretokenizer.
This is different from a (unicode)char level BPE algorithm based tokenizers that utilize byte_fallback within the BPE algorithm to handle <unk>
cases.
Theoretically and practically, the ByteLevel pretokenizer prevents <unk>
from which the BPE algorithm must fall back from.
The BPE byte_fallback was implemented to copy behavior of sentencepiece where there is no bytelevel pretokenization stage and the tokenizer "falls back" to unicode when encountering <unk>
. LLaMA used sentencepiece with BPE and thus required huggingface to implement byte_fallback in BPE that wasnt necessary in a way that wasnt necessary for unigram. But since this was implemented it is also being implemented in an exposed manner in unigram too.
(disclaimer : this is an explanation of my understanding of the situation after observing development closely of both sentencepiece and huggingface tokenizers. I am currently not a developer of either)
Some references https://github.com/huggingface/tokenizers/issues/1218 https://github.com/huggingface/tokenizers/pull/1217
Very good explanation.
byte_fallback
is indeed correctly set to false on all of these.
@xenova whenever you think something like that is incorrect, please explain why you think the values are incorrect.
Since you didn't provide explanations as to why you think it should be false, I'll simply say that this is correct without further explanation for now (sorry my time is limited)
Thanks for the response 👍
I must have been mistaken about the purpose of byte_fallback
, as I thought it also controlled whether to perform the trick of avoiding control tokens of the BPE.
If possible, could you please explain where in the BPE code it does this mapping? I'm much more familiar with Python, so I will reference the slow tokenization code:
In some situations (e.g., GPT2Tokenizer), the tokenization maps bytes to unicode strings: https://github.com/huggingface/transformers/blob/78941b9fe50c7a1c53d2e6723e9a3b7f5ae2a715/src/transformers/models/gpt2/tokenization_gpt2.py#L298-L306
However, in other situations, (e.g., NllbTokenizer), this does not happen: https://github.com/huggingface/transformers/blob/78941b9fe50c7a1c53d2e6723e9a3b7f5ae2a715/src/transformers/models/nllb/tokenization_nllb.py#L334-L335
I would greatly appreciate if you could point out any part in the tokenizer.json file where I can make this distinction.
Here are their relevant parts of their config's for reference:
GPT2:
"dropout": null,
"unk_token": null,
"continuing_subword_prefix": "",
"end_of_word_suffix": "",
"fuse_unk": false,
"byte_fallback": false,
Nllb:
"dropout": null,
"unk_token": "<unk>",
"continuing_subword_prefix": null,
"end_of_word_suffix": null,
"fuse_unk": true,
"byte_fallback": false,
There are differences (e.g., continuing_subword_prefix
), but I'm not too sure which to use.
nllb tokenizer is BPE but sentencepiece based. It is useful to understand what framework is being used, and what kind of preprocessing and post processing is done. In this code base in particular, the GPT like mapping is done in the bytelevel() pretokenizer. https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/pre_tokenizers/byte_level.rs
some options, such as the continuing word suffix were used by the word level tokenizer. you need to chain decoders to properly decode it but usually it is not what you are looking for.
One way to approach this is to look at a model that you'd like to emulate and follow their lead.
I would greatly appreciate if you could point out any part in the tokenizer.json file where I can make this distinction.
As it is, it is often ambiguous unfortunately. It is difficult to address from this codebase alone since the transformers codebase also deals withit through their autotokenizer class. Even if it is fixed from this codebase, there are legacy tokenizers.
Ohhh! Thanks for pointing that out @chris-ha458!
This solved my issue :)
Environment:
Reproduction:
Will output a tokenizer.json file containing:
However, as seen in the slow tokenizer here, it should be true.
byte_fallback
was introduced by https://github.com/huggingface/tokenizers/pull/1183 (@Narsil), so it probably has something to do with how the default value is set (which is false).Quite a few models use BPE with byte_fallback set to true by default. I can grab some more examples if needed.