google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
37.97k stars 9.58k forks source link

Tokenization behavior with messed-up unicode characters #1093

Open sotlampr opened 4 years ago

sotlampr commented 4 years ago

Hi,

I've been porting the tokenizer to C++ with ICU for unicode normalization and notice that in some cases the FullTokenizer class fails:

poc.sh:

#!/bin/bash
if [ ! -d "uncased_L-12_H-768_A-12" ]; then
  wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip &&\
  unzip uncased_L-12_H-768_A-12.zip
fi
python -c '
from bert.tokenization import FullTokenizer
tok = FullTokenizer("uncased_L-12_H-768_A-12/vocab.txt", True)
print("to send me your vote♡♡")
print(tok.tokenize("to send me your vote♡♡"))
print("(to send me your vote [UNK])")
print()
print("Good morning Rushers!♥♡")
print(tok.tokenize("Good morning Rushers!♥♡"))
print("(good morning rush ##ers ! ♥ [UNK])")
print()
print("almost to the finish line♡")
print(tok.tokenize("almost to the finish line♡"))
print("(almost to the finish line [UNK])")
print()
'

Outputs:

to send me your vote♡♡
['to', 'send', 'me', 'your', '[UNK]']
(to send me your vote [UNK])  # <-- My implementation

Good morning Rushers!♥♡
['good', 'morning', 'rush', '##ers', '!', '[UNK]']
(good morning rush ##ers ! ♥ [UNK])  # <-- My implementation

almost to the finish line♡
['almost', 'to', 'the', 'finish', '[UNK]']
(almost to the finish line [UNK])  # <-- My implementation

It seems like it's happening only on edge cases with really mesed up codepoints.

OS: Ubuntu 18.0.4.4 LTS Python: 3.6.9

sotlampr commented 4 years ago

I don't know if this is the desired behavior but you might want to take a look at this: https://github.com/sotlampr/bert/commit/30cf03116cf2ed909c92da2a5be7e8a924778de7