huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.58k stars 27.14k forks source link

Strange behavior of translation (text generation) pipelines #19396

Closed Fikavec closed 2 years ago

Fikavec commented 2 years ago

System Info

Who can help?

Models:

Example - [QUESTION] model translates only a part of the text

Information

Tasks

Reproduction

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")

translator = pipeline('translation', model=model, tokenizer=tokenizer, src_lang="ces_Latn", tgt_lang='eng_Latn',device=0)

# Text with 3 sentences: 1) Zuzka bydlí v paneláku na 9 podlaží. 2) Anička bydlí o 3 podlaží výše. 3) Na kterém podlaží bydlí Anička? 
text="Zuzka bydlí v paneláku na 9 podlaží. Anička bydlí o 3 podlaží výše. Na kterém podlaží bydlí Anička?"
translator(text, max_length=512, num_beams=5,)

Outputs only one sentence (2 sentences lost):

[{'translation_text': 'Zuzka lives in a nine-story penthouse, Anička lives three floors up.'}]

If we add to translator parameter min_length like how-to-generate article: translator(text, max_length=512, num_beams=5, min_length=512 ) (for many languages (ja, zh, etc) we don't know translated length in tokens, but we don't want to lose the text and set min_length bigger)

It's output translated text with repeates:

{'translation_text': "Zuzka lives in a boarding house on the ninth floor, Anička lives three floors upstairs, which floor does Anička live on, what floor does Anička live on, what floor does Anička live on, what floor does Anička live on, what floor does Anička live on, what floor does Anička live on, what floor does Anička live on, what floor does Anička live on, what floor does Anička live on, what floor does Anička live on, what floor does Anička live on, what floor does Anička live on, what floor does Anička live on, what floor does Anička lives on, what floor does Anička lives on, what floor does Anička lives on, what floor does Anička lives on, what floor does Anička lives on, what floor does she lives on, what floor does Anička lives on, what floor does she lives on, what floor does she lives on, what floor, what floor does she lives on, what floor she lives on, what floor, and what floor she lives on the floor, and what floor, and what is she lives on the floor, and what is the floor, and what is the floor of the floor, and what is the floor, and what is the floor, and what is the floor, and what is the floor, and what is the floor, and what is the floor, and what is the floor, and what is the floor, and what does she's on the floor, and what is, and what is, and what is, and what is, and what is, and what is, and what is, and what is, and what is, and what is, and what is, and what is, and what is, what is, and what is, what is, what is, what is, and what is, what is, and what is, what is, and what is, what is, what is, and what is, what is, and what is, and what is, and what is, and what is, and what is, and what is, and what is, what is, and what is, what is, and what is, and what is, what is, and what is, and what is, and what is, and what is, and what is, and what is, and what is, and what is, and what is, and what is, and what is, and what is, and what is"}

If we add many other parameters combinations: translator(text, max_length=512, min_length=512, num_beams=5, no_repeat_ngram_size=3, do_sample=True, temperature=1.5, top_p=0.9, early_stopping=True, remove_invalid_values=True)

The translation will contain generated text that was not in the original sentence:

[{'translation_text': "Zuzka's living in a penthouse on the ninth floor, Anička's in a three story apartment, which floor does Anička reside on, and what floor is the building on which the building is housed, and how are you supposed to know where she's staying, so what's the floor where the apartment is on the 9th floor... and what is the first floor where Anička is staying... and how is the second floor of the house where the house is, so... what floor does she live on, where's Anička, the third floor, and where is Anička staying in the apartment on the 3rd floor, where you can't find her room, where she can'd say she'd like to go on her own, and you'd wanna know what to do with her room in the next room, so you can I'd tell me that she can be sure that you's not going to be happy with the room to do it, right now, that is, it's all right, you know, right or at least I's right, and I don't think that she't, and that't know that they's what I'll have something that, and we'll want you know that you can be honestly, you'll know that'll be honest that you, right, I mean that I'm sure, you can tell you't you will be right, that that it'll say it't be all right or whatever you know about that you will, you don're not that, but it'd you've got to you know it'm gonna be true, you say that you know right, if they't that's going to me that, I't say, and it' and that, that I will be true or you won'll always, and is, and she'll let me, you will not that'm right, yes or what you' will be that that right, but, and will be, you are gonna be safe to you'l right, or that that'lll be true that we't ever, and yes, but I'l be, right right, they'm going to say, she will be honest or not gonna say that we are, and, that're all right right that he is, you gonna be, but you'"}]

What parameters should be used to get the correct translation of the correct length for many languages with unknown translation lengths? Why does text generation start instead of translation? This is behavior of transformers pipelines or translation models?

Expected behavior

English translation with 3 sentences:

gante commented 2 years ago

Hey @Fikavec 👋

Text generation can be very tricky, as you've just explained. The quality of the generated text (i.e. the translation) depends on two things: the model and the generation method.

Regarding the model, my suggestion would be to use a larger model OR a model that contains a single language pair (as opposed to multilingual). You can use the language tags on the Hugging Face Hub 🤗 to help you navigate the sea of models.

Regarding the generation method, you've already mentioned the blog post I usually redirect to in this sort of issues :) If you force min_length, the model tends to hallucinate after it runs out of the original content, so I highly advise not to use it. However, if you don't do it, you may get a too short output (your first example) -- in that case, you may try playing with the length_penalty parameter (which only has impact with num_beams>1).

If these two sets of tips do not yield successful results, I still have good news for you -- we are working to implement a new generation strategy which may help in your case (https://github.com/huggingface/transformers/issues/19182) :)

Fikavec commented 2 years ago

Thanks @gante for the explanation and work in the greatest project! I can't figure out is this issue a feature of the huggingface generators implementation or the original fairseq translation models? Translation is very specific text generation task where precission output length is critical - if output length or other generation parameters is necessary for correct translation they can be predicted by special model on top of the tokenizer before translation generation. #19182 is interesting, but after spent a lot of time for searching parameters manualy i’m think that create only one formula for 40 000 translation directions is a miracle. Maybe fairseq team may train model for predict best genreration for 200+ languages on their parallel learning data, as the language definition model has trained and, in the future of generators development, models for selecting the best generation parameters will become a standard step after tokenization or a parameter of generator functions as generate(input_text, params_predictor=predict_best_params_model) and predict_best_params_models separately developed and trained for different tasks (translation, qa, prompt engineering, etc.) by the authors of generative models and community with special tests sets and metrics. What do you think about this?

gante commented 2 years ago

if output length or other generation parameters is necessary for correct translation

It is not -- generation ends when the model predicts a special token (eos_token_id) OR when the generation length reaches max_length. This is why you should add a large max_length, so the translation is not constrained by it :)

As for your other question, as you wrote, setting the parameters depends on the model itself and your goals -- there is no silver bullet that would fit everyone. However, we have a library that might be of your interest: evaluate

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.