huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
9.05k stars 799 forks source link

[bug] BPE `roberta-large-mnli` saved with `.save_pretrained()` incorrectly sets `byte_fallback` to false (should be true) #1234

Closed xenova closed 1 year ago

xenova commented 1 year ago

Environment:

Reproduction:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('roberta-large-mnli')
tokenizer.save_pretrained('test')

Will output a tokenizer.json file containing:

  "model": {
    "type": "BPE",
    "dropout": null,
    "unk_token": null,
    "continuing_subword_prefix": "",
    "end_of_word_suffix": "",
    "fuse_unk": false,
    "byte_fallback": false, // <---- Should be true
  }

However, as seen in the slow tokenizer here, it should be true.

byte_fallback was introduced by https://github.com/huggingface/tokenizers/pull/1183 (@Narsil), so it probably has something to do with how the default value is set (which is false).

Quite a few models use BPE with byte_fallback set to true by default. I can grab some more examples if needed.

xenova commented 1 year ago

As I come across more models which have this problem, I'll update the following list:

chris-ha458 commented 1 year ago

Most if not all Roberta based models and GPT2 based tokenizer models should be considered BPE algorithm based tokenizers that utilize the ByteLevel() pretokenizer. This is different from a (unicode)char level BPE algorithm based tokenizers that utilize byte_fallback within the BPE algorithm to handle <unk> cases. Theoretically and practically, the ByteLevel pretokenizer prevents cases by handling all possible byte combinations and thus there is no <unk> from which the BPE algorithm must fall back from.

The BPE byte_fallback was implemented to copy behavior of sentencepiece where there is no bytelevel pretokenization stage and the tokenizer "falls back" to unicode when encountering <unk>. LLaMA used sentencepiece with BPE and thus required huggingface to implement byte_fallback in BPE that wasnt necessary in a way that wasnt necessary for unigram. But since this was implemented it is also being implemented in an exposed manner in unigram too.

(disclaimer : this is an explanation of my understanding of the situation after observing development closely of both sentencepiece and huggingface tokenizers. I am currently not a developer of either)

Some references https://github.com/huggingface/tokenizers/issues/1218 https://github.com/huggingface/tokenizers/pull/1217

Narsil commented 1 year ago

Very good explanation.

byte_fallback is indeed correctly set to false on all of these. @xenova whenever you think something like that is incorrect, please explain why you think the values are incorrect.

Since you didn't provide explanations as to why you think it should be false, I'll simply say that this is correct without further explanation for now (sorry my time is limited)

xenova commented 1 year ago

Thanks for the response 👍

I must have been mistaken about the purpose of byte_fallback, as I thought it also controlled whether to perform the trick of avoiding control tokens of the BPE.

If possible, could you please explain where in the BPE code it does this mapping? I'm much more familiar with Python, so I will reference the slow tokenization code:


I would greatly appreciate if you could point out any part in the tokenizer.json file where I can make this distinction.

Here are their relevant parts of their config's for reference:

GPT2:

    "dropout": null,
    "unk_token": null,
    "continuing_subword_prefix": "",
    "end_of_word_suffix": "",
    "fuse_unk": false,
    "byte_fallback": false,

Nllb:

    "dropout": null,
    "unk_token": "<unk>",
    "continuing_subword_prefix": null,
    "end_of_word_suffix": null,
    "fuse_unk": true,
    "byte_fallback": false,

There are differences (e.g., continuing_subword_prefix), but I'm not too sure which to use.

chris-ha458 commented 1 year ago

nllb tokenizer is BPE but sentencepiece based. It is useful to understand what framework is being used, and what kind of preprocessing and post processing is done. In this code base in particular, the GPT like mapping is done in the bytelevel() pretokenizer. https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/pre_tokenizers/byte_level.rs

chris-ha458 commented 1 year ago

some options, such as the continuing word suffix were used by the word level tokenizer. you need to chain decoders to properly decode it but usually it is not what you are looking for.

One way to approach this is to look at a model that you'd like to emulate and follow their lead.

chris-ha458 commented 1 year ago

I would greatly appreciate if you could point out any part in the tokenizer.json file where I can make this distinction.

As it is, it is often ambiguous unfortunately. It is difficult to address from this codebase alone since the transformers codebase also deals withit through their autotokenizer class. Even if it is fixed from this codebase, there are legacy tokenizers.

xenova commented 1 year ago

Ohhh! Thanks for pointing that out @chris-ha458!

This solved my issue :)