huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.87k stars 27.19k forks source link

Weird behavior with mBART-50 and Spanish #12958

Open ArbinTimilsina opened 3 years ago

ArbinTimilsina commented 3 years ago

Environment info

Who can help

@patrickvonplaten

Information

I am seeing weird behavior with mBART-50 and Spanish. Please look at the code below:

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

text = "http://www.ted.com/talks/stephen_palumbi_following_the_mercury_trail.html"

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-one-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-one-mmt")
tokenizer.src_lang = "es_XX"

encoded = tokenizer(text, return_tensors="pt")
generated_tokens = model.generate(**encoded, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

The output is:

['(b) To continue to cooperate closely with the Special Rapporteur on extrajudicial, summary or arbitrary executions, the Special Rapporteur on torture and other cruel, inhuman or degrading treatment or punishment, the Special Rapporteur on the sale of children, child prostitution and child pornography, the Special Rapporteur on torture and other cruel, inhuman or degrading treatment or punishment, the Special Rapporteur on the sale of children, child prostitution and child pornography, the Special Rapporteur on the sale of children, child prostitution and child pornography, the Special Rapporteur on the sale of children, child prostitution and child pornography, the Special Rapporteur on violence against women, its causes and consequences, the Special Rapporteur on the sale of children, child prostitution and child pornography, the Special Rapporteur on the sale of children, child prostitution and child pornography, the Special']

However if I change the source language to french tokenizer.src_lang = "fr_XX" or any other language, I get the following output (which is what you expect):

['http://www.ted.com/talks/stephen_palumbi_following_the_mercury_trail.html']

This behavior is similar with other texts as well (e.g., "888"). Do you know why this behavior is unique to Spanish? Also, do you have any idea how to correct this behavior?

Thanks!

LysandreJik commented 3 years ago

Pinging @patil-suraj too, and @mrm8488 might have played with that model in the past.

ianbstewart commented 2 years ago

Any progress here? I've faced the exact same problem when attempting to translate from Spanish, although slightly different output:

The Committee recommends that the State party take all necessary measures to ensure that the right to adequate housing is guaranteed in the State party's next periodic report, and that the State party take all necessary measures to ensure that the right to adequate housing is guaranteed in its next periodic report.
patrickvonplaten commented 2 years ago

@patil-suraj - could you take a look here?

nehasrikn commented 9 months ago

+1 I've been having the same issue translating from Spanish to English. Could someone take a look?