I've been porting the tokenizer to C++ with ICU for unicode normalization and notice that in some cases the FullTokenizer class fails:
poc.sh:
#!/bin/bash
if [ ! -d "uncased_L-12_H-768_A-12" ]; then
wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip &&\
unzip uncased_L-12_H-768_A-12.zip
fi
python -c '
from bert.tokenization import FullTokenizer
tok = FullTokenizer("uncased_L-12_H-768_A-12/vocab.txt", True)
print("to send me your vote♡♡")
print(tok.tokenize("to send me your vote♡♡"))
print("(to send me your vote [UNK])")
print()
print("Good morning Rushers!♥♡")
print(tok.tokenize("Good morning Rushers!♥♡"))
print("(good morning rush ##ers ! ♥ [UNK])")
print()
print("almost to the finish line♡")
print(tok.tokenize("almost to the finish line♡"))
print("(almost to the finish line [UNK])")
print()
'
Outputs:
to send me your vote♡♡
['to', 'send', 'me', 'your', '[UNK]']
(to send me your vote [UNK]) # <-- My implementation
Good morning Rushers!♥♡
['good', 'morning', 'rush', '##ers', '!', '[UNK]']
(good morning rush ##ers ! ♥ [UNK]) # <-- My implementation
almost to the finish line♡
['almost', 'to', 'the', 'finish', '[UNK]']
(almost to the finish line [UNK]) # <-- My implementation
It seems like it's happening only on edge cases with really mesed up codepoints.
Hi,
I've been porting the tokenizer to C++ with ICU for unicode normalization and notice that in some cases the FullTokenizer class fails:
poc.sh
:Outputs:
It seems like it's happening only on edge cases with really mesed up codepoints.
OS: Ubuntu 18.0.4.4 LTS Python: 3.6.9