Closed Mehrad0711 closed 2 years ago
In fact, this seems to be a problem with other spm based tokenizers too. Other MBART checkpoints as well as MT5 and XLMR models have the same behavior but not multilingual BERT checkpoints. Not sure if this issue has been reported/ discussed before. Any hints are appreciated.
@patil-suraj - could you take a look here for MBart "many-to-many"?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi, was wondering if there are any updates?
Hi @Mehrad0711 Sorry to only reply now.
I will try to allocate some time this week for it.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@patil-suraj - ping again here :-)
@Mehrad0711 @patrickvonplaten Sorry about being super slow here.
I'm not sure if this is really a bug, it looks like the punctuations are normalized by the spm model itself. You could load the original spm model from mbart and see that it normalizes the string during tokenization.
To verify, download the official spm model from here https://github.com/pytorch/fairseq/tree/main/examples/mbart
import sentencepiece as spm
sp_model = spm.SentencePieceProcessor()
sp_model.Load("mbart.cc25.v2/sentence.bpe.model")
sentence = '您好, 您打算到哪里去呢?'
tokenized = sp_model.encode_as_pieces(sentence)
# => ['▁您', '好', ',', '您', '打算', '到', '哪里', '去', '呢', '?']
decoded = sp_model.decode_pieces(tokenized)
# => '您好,您打算到哪里去呢?'
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Environment info
transformers
version: latest (4.10.0.dev0)Who can help
@patrickvonplaten, @patil-suraj
Information
Model I am using (Bert, XLNet ...):
mbart-large-50-many-to-many-mmt
To reproduce
Running the following script shows that encoding and decoding a Chinese string would not give back the same string (punctuation marks will be normalized):
stdout:
Using slow version of the tokenizer or setting src_lang and tgt_lang attributes directly would give the same results.
Expected behavior
Expected stdout: