huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.28k stars 26.85k forks source link

Possible bug in spm-based tokenizers #12867

Closed Mehrad0711 closed 2 years ago

Mehrad0711 commented 3 years ago

Environment info

Who can help

@patrickvonplaten, @patil-suraj

Information

Model I am using (Bert, XLNet ...): mbart-large-50-many-to-many-mmt

To reproduce

Running the following script shows that encoding and decoding a Chinese string would not give back the same string (punctuation marks will be normalized):

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('facebook/mbart-large-50-many-to-many-mmt', src_lang='zh_CN', tgt_lang='zh_CN')

sentence = '您好,您打算到哪里去呢?'
input = tokenizer(sentence)
output = tokenizer.decode(input['input_ids'], skip_special_tokens=True)

print(output)
print(output == sentence)

stdout:

您好,您打算到哪里去呢?
False

Using slow version of the tokenizer or setting src_lang and tgt_lang attributes directly would give the same results.

Expected behavior

Expected stdout:

您好,您打算到哪里去呢?
True
Mehrad0711 commented 3 years ago

In fact, this seems to be a problem with other spm based tokenizers too. Other MBART checkpoints as well as MT5 and XLMR models have the same behavior but not multilingual BERT checkpoints. Not sure if this issue has been reported/ discussed before. Any hints are appreciated.

patrickvonplaten commented 3 years ago

@patil-suraj - could you take a look here for MBart "many-to-many"?

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Mehrad0711 commented 3 years ago

Hi, was wondering if there are any updates?

patil-suraj commented 3 years ago

Hi @Mehrad0711 Sorry to only reply now.

I will try to allocate some time this week for it.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

patrickvonplaten commented 2 years ago

@patil-suraj - ping again here :-)

patil-suraj commented 2 years ago

@Mehrad0711 @patrickvonplaten Sorry about being super slow here.

I'm not sure if this is really a bug, it looks like the punctuations are normalized by the spm model itself. You could load the original spm model from mbart and see that it normalizes the string during tokenization.

To verify, download the official spm model from here https://github.com/pytorch/fairseq/tree/main/examples/mbart

import sentencepiece as spm

sp_model = spm.SentencePieceProcessor()
sp_model.Load("mbart.cc25.v2/sentence.bpe.model")

sentence = '您好, 您打算到哪里去呢?'
tokenized = sp_model.encode_as_pieces(sentence)
# => ['▁您', '好', ',', '您', '打算', '到', '哪里', '去', '呢', '?']

decoded = sp_model.decode_pieces(tokenized)
# => '您好,您打算到哪里去呢?'
github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.