huggingface / tokenizers

šŸ’„ Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.9k stars 769 forks source link

BertWordPieceTokenizer tokenizes invalid characters differently from original BERT implementation #576

Closed ethch18 closed 3 years ago

ethch18 commented 3 years ago

Hi! I've run into an issue tokenizing some noisy text from the web using the BertWordPieceTokenizer. Specifically, unrecognizable characters are tokenized as UNK with the original BERT WordPiece tokenizer, but they are ignored by the tokenizers implementation. Here's a MWE (Python):

import tokenization as bert_tokenization # from original tf BERT repo
from tokenizers import BertWordPieceTokenizer
vocab_file = "/path/to/mbert/vocab.txt"

base_tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=False)
rust_tokenizer = BertWordPieceTokenizer(
    vocab_file,
    clean_text=True,
    handle_chinese_chars=True,
    strip_accents=False,
    lowercase=False,
)

line = "ļ™ Il-qassisin- L-indu jemmnu li hargu minn halq Brahman."
base_wordpieces = base_tokenizer.tokenize(line)
rust_wordpieces = rust_tokenizer.encode(line, add_special_tokens=False).tokens

print(base_wordpieces == rust_wordpieces)
print()
print(base_wordpieces)
print()
print(rust_wordpieces)

(where the unrecognizable character is encoded as b"\xef\x81\x99 Output:

False

['[UNK]', 'Il', '-', 'q', '##ass', '##isin', '-', 'L', '-', 'ind', '##u', 'je', '##mm', '##nu', 'li', 'har', '##gu', 'min', '##n', 'hal', '##q', 'Br', '##ah', '##ma', 'n']

['Il', '-', 'q', '##ass', '##isin', '-', 'L', '-', 'ind', '##u', 'je', '##mm', '##nu', 'li', 'har', '##gu', 'min', '##n', 'hal', '##q', 'Br', '##ah', '##ma', 'n']"

(specifically, note that everything is the same except for the UNK corresponding to the unrecognizable character at the start)

This is mainly a problem when the characters come at the end of some sequence, and omitting them means the difference between an UNK and a viable tokenization. However, it does mean that the tokenizers implementation yields different results than the original implementation in some cases.

For reference, here's a list of other characters that I've seen this happen on; seems like this occurs in the higher ranges of the Unicode table:

\xee\x80\x84
\xee\x80\x83
\xee\x80\x94
\xee\x80\x93
\xee\x80\x81
\xee\x80\xdb
\xef\x82\xa7
\xee\x83\xbd
\xef\x82\xb0
\xef\x82\xb4

Thanks!


Tokenizers Version: 0.9.4 Python: 3.6 BERT version: master (the tokenization file hasn't been touched for over a year, since mBERT was released)

Narsil commented 3 years ago

Which vocab.txt are you using ? It seems the [UNK] token is missing from you file. I used this one: https://huggingface.co/bert-base-uncased/raw/main/vocab.txt And everything seems to be working normally (I don't get your results that's why I suppose you're not using the same file)

Also which versions of tokenizers are you using ? I'm using 0.9.4.

Happy new year !

ethch18 commented 3 years ago

I'm using the multilingual-base-cased vocabulary from Google. I don't think the [UNK] token is missing from this file: see the output above -- UNK is present in the Google tokenizer and not the huggingface one, using the same vocab. I'm using 0.9.4; other env details are in the footer of the original comment.

n1t0 commented 3 years ago

This difference has been introduced with this commit: https://github.com/google-research/bert/commit/d66a146741588fb208450bde15aa7db143baaa69

It changed the cleaning part of the original bert tokenizer to avoid removing some characters, namely:

Abbr Long Description
Cs Surrogate a surrogate code point
Co Private_Use a private-use character
Cn Unassigned a reserved unassigned code point or a noncharacter

(source: https://unicode.org/reports/tr44/#General_Category_Values)

The commit says it is to allow emojis, but I don't really understand why since emojis are not part of these. When trying to encode some emojis with our version of the tokenizer you'll be able to see that they get replaced by [UNK] as expected (if not part of the vocabulary of course).

So I'd say everything works as expected on this side, and it is very unlike for us to update this unless we have a good reason to do so. If someone has some more information, feel free to jump in / re-open the issue!

ethch18 commented 3 years ago

Hi @n1t0 -- thanks for your response! I'm not sure that this would actually be expected behavior, as it actually leads to some more significant differences between the tokenizers and Google tokenizers in places where these characters are present.

For instance, consider the binary string b"10\xe2\x84\x83\xee\x82\xae, which is 10ā„ƒ (with a degree celsius glyph, which is in the vocabulary) followed by one of the removed characters. This string tokenizes two different ways:

Google: ['[UNK]']
tokenizers: ['10', '##\xe2\x84\x83']"

The current tokenizers implementation is arguably more "desirable" or "correct" for a general wordpiece tokenizer, but either way, there are differences in how this string is tokenized, which may lead to additional divergences downstream. Since this is the BertWordPieceTokenizer, which is supposed to replicate BERT's behaviors, it seems that this is still worth looking into in order to ensure equal results between the two implementations.

n1t0 commented 3 years ago

I agree that our implementation should replicate BERT's behaviors and I believe it does for most of the models it supports. As you can see with the commit I've linked, this modification of the original tokenizer landed after most of the models (except the WWM ones) were released. I believe it means that all these tokenizers and their released vocabs rely on the first version of the tokenizer (the one used by tokenizers and transformers), so they should keep using this version.

That being said, I'm entirely open to the discussion, and we could add an option to switch between the two methods if it's relevant. Would you be willing to run some evaluations on the various models, using both methods to see if there's any significant difference?

ethch18 commented 3 years ago

Thanks for the clarification -- that makes sense. I missed the date stamp on the commit you linked, and I agree with your reasoning.

My use case is that (for various reasons) I'm training new BERT models with the Google code and porting them over to transformers, which means that I'm also following their updated tokenization scheme. But since the number of impacted sentences/lines is so small, I don't think there will be an issue on my side either.