google-research / albert

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Apache License 2.0
3.23k stars 571 forks source link

Discrepancy in tokenization results using albert's tokenizer and sentencepiece library #249

Open anjali-chadha opened 2 years ago

anjali-chadha commented 2 years ago

Hi -

I recently noticed that tokenized results from albert's tokenizer implementation and sentencepiece library differ for some inputs. Check below:

SentencePiece Implementation

!pip install sentencepiece

import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load('<SPM_MODEL>')
print(sp.encode_as_pieces('3.0,'))
print(sp.encode_as_ids('3.0,'))

Output:
['▁3.0,']
[72369]

Using Albert

pip install sentencepiece
git clone https://github.com/google-research/albert.git

>> import tokenization
>>> spm_tokenizer = tokenization.FullTokenizer(vocab_file=<VOCAB_FILE>, spm_model_file=<SPM_MODEL_FILE>) 
>>> spm_tokenizer.convert_tokens_to_ids(spm_tokenizer.tokenize("3.0,"))

Output:
[16047, 254713]

After looking at Albert's tokenizer implementation, I see that the if condition here is leading to the differences in the outputs above. https://github.com/google-research/albert/blob/master/tokenization.py#L67

Could you explain the intuition behind having this additional steps in albert's tokenizer and what purpose do they serve here?

Thanks!