Closed Fikavec closed 2 years ago
Hey @Fikavec 👋
Text generation can be very tricky, as you've just explained. The quality of the generated text (i.e. the translation) depends on two things: the model and the generation method.
Regarding the model, my suggestion would be to use a larger model OR a model that contains a single language pair (as opposed to multilingual). You can use the language tags on the Hugging Face Hub 🤗 to help you navigate the sea of models.
Regarding the generation method, you've already mentioned the blog post I usually redirect to in this sort of issues :) If you force min_length
, the model tends to hallucinate after it runs out of the original content, so I highly advise not to use it. However, if you don't do it, you may get a too short output (your first example) -- in that case, you may try playing with the length_penalty
parameter (which only has impact with num_beams
>1).
If these two sets of tips do not yield successful results, I still have good news for you -- we are working to implement a new generation strategy which may help in your case (https://github.com/huggingface/transformers/issues/19182) :)
Thanks @gante for the explanation and work in the greatest project! I can't figure out is this issue a feature of the huggingface generators implementation or the original fairseq translation models? Translation is very specific text generation task where precission output length is critical - if output length or other generation parameters is necessary for correct translation they can be predicted by special model on top of the tokenizer before translation generation. #19182 is interesting, but after spent a lot of time for searching parameters manualy i’m think that create only one formula for 40 000 translation directions is a miracle. Maybe fairseq team may train model for predict best genreration for 200+ languages on their parallel learning data, as the language definition model has trained and, in the future of generators development, models for selecting the best generation parameters will become a standard step after tokenization or a parameter of generator functions as generate(input_text, params_predictor=predict_best_params_model) and predict_best_params_models separately developed and trained for different tasks (translation, qa, prompt engineering, etc.) by the authors of generative models and community with special tests sets and metrics. What do you think about this?
if output length or other generation parameters is necessary for correct translation
It is not -- generation ends when the model predicts a special token (eos_token_id
) OR when the generation length reaches max_length
. This is why you should add a large max_length
, so the translation is not constrained by it :)
As for your other question, as you wrote, setting the parameters depends on the model itself and your goals -- there is no silver bullet that would fit everyone. However, we have a library that might be of your interest: evaluate
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.22.2Who can help?
Models:
Example - [QUESTION] model translates only a part of the text
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Outputs only one sentence (2 sentences lost):
If we add to translator parameter min_length like how-to-generate article:
translator(text, max_length=512, num_beams=5, min_length=512 )
(for many languages (ja, zh, etc) we don't know translated length in tokens, but we don't want to lose the text and set min_length bigger)It's output translated text with repeates:
If we add many other parameters combinations:
translator(text, max_length=512, min_length=512, num_beams=5, no_repeat_ngram_size=3, do_sample=True, temperature=1.5, top_p=0.9, early_stopping=True, remove_invalid_values=True)
The translation will contain generated text that was not in the original sentence:
What parameters should be used to get the correct translation of the correct length for many languages with unknown translation lengths? Why does text generation start instead of translation? This is behavior of transformers pipelines or translation models?
Expected behavior
English translation with 3 sentences: