huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.49k stars 26.89k forks source link

Empty sentence and minus translation in opus-mt-de-en model #14337

Closed kateryna-bud closed 2 years ago

kateryna-bud commented 2 years ago

Environment info

Who can help

@patrickvonplaten

Information

Model I am using (MarianMT):

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

from transformers import MarianMTModel, MarianTokenizer
model_name = 'Helsinki-NLP/opus-mt-de-en'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
src_text = [" ", "-"]
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
[tokenizer.decode(t, skip_special_tokens=True) for t in translated]

Output: ["I don't know.", '- No, no, no, no, no, no, no.']

Expected behavior

Return an empty sentence or the character, since there is nothing to translate [" ", "-"]

Thanks in advance!

Cheers, Kateryna

patrickvonplaten commented 2 years ago

Hey @kateryna-bud,

Marian models were not really trained on inputs such as [" ","-"] - so this data can be considered as strong out of distribution data which will have unpredictable outputs.

Why would you need translate a single empty space? :-)

kateryna-bud commented 2 years ago

Hi @patrickvonplaten,

thanks for your answer. I have another nlp preprocessing steps. In some cases empty sentences are produced. I remove those now, but I wonder why the model returns "I don't know". I thought that this is maybe a default setting if the output is unpredictable. In this case I would like to adjust it.

What about the "-"? For other characters, the prediction is the same as the input, but not for the minus.

Thanks, Kateryna

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

kateryna-bud commented 2 years ago

Hi @patrickvonplaten

The translation model for the input word 'hec' returns ['Hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey.']

How to handle those "hickups".

patrickvonplaten commented 2 years ago

Hey @kateryna-bud - I don't think "hec" is a valid words and I also don't know what you would expect to be the translation here. In general translation models are by no means perfect and can have unexpected behavior. You could try to apply some of the methods as described here: https://huggingface.co/blog/how-to-generate

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

hdeval1 commented 1 year ago

Did anyone find a resolution to this?