huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.6k stars 27.14k forks source link

how to use EncoderDecoderModel to do en-de translation? #8944

Open CharizardAcademy opened 3 years ago

CharizardAcademy commented 3 years ago

I have trained a EncoderDecoderModel from huggging face to do english-German translation task. I tried to overfit a small dataset (100 parallel sentences), and use model.generate() then tokenizer.decode() to perform the translation. However, the output seems to be proper German sentences, but it is definitely not the correct translation.

Here are the code for building the model

encoder_config = BertConfig()
decoder_config = BertConfig()
config = EncoderDecoderConfig.from_encoder_decoder_configs(encoder_config, decoder_config)
model = EncoderDecoderModel(config=config)

Here are the code for testing the model

model.eval()
input_ids = torch.tensor(tokenizer.encode(input_text)).unsqueeze(0)
output_ids = model.generate(input_ids.to('cuda'), decoder_start_token_id=model.config.decoder.pad_token_id)
output_text = tokenizer.decode(output_ids[0])

Example input: "iron cement is a ready for use paste which is laid as a fillet by putty knife or finger in the mould edges ( corners ) of the steel ingot mould ."

Ground truth translation: "iron cement ist eine gebrauchs ##AT##-##AT## fertige Paste , die mit einem Spachtel oder den Fingern als Hohlkehle in die Formecken ( Winkel ) der Stahlguss -Kokille aufgetragen wird ."

What the model outputs after trained 100 epochs: "[S] wenn sie den unten stehenden link anklicken, sehen sie ein video uber die erstellung ansprechender illustrationen in quarkxpress" which is totally nonesense.

Where is the problem?

LysandreJik commented 3 years ago

Hello, thanks for opening an issue! We try to keep the github issues for bugs/feature requests. Could you ask your question on the forum instead?

Thanks!

cc @patrickvonplaten who might have an idea.

patrickvonplaten commented 3 years ago

This blog post should also help on how to fine-tune a warm-started Encoder-Decoder model: https://huggingface.co/blog/warm-starting-encoder-decoder . But as @LysandreJik said the forum is the better place to ask.

zmf0507 commented 3 years ago

@patrickvonplaten the blog post mentions about a notebook link for machine translation task but on clicking, it redirects to the blog only. I think there might be some mistake while adding the notebook link. Can you please share the translation task notebook on WMT dataset?

patrickvonplaten commented 3 years ago

Hey @zmf0507 - yeah I sadly haven't found the time yet to do this notebook

zmf0507 commented 3 years ago

@patrickvonplaten please let me know here when you make one. Despite being so popular, hugging-face doesn't provide any tutorial/notebook for machine translation. I think a lot of people might be looking for similar resources. Will help much. Thanks

patrickvonplaten commented 3 years ago

We have now one for mBart: https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb -> will try to make one for Encoder Decoder as well when I find time :-)

zmf0507 commented 3 years ago

sure. thanks a lot :)

zmf0507 commented 3 years ago

@patrickvonplaten is there any encoder-decoder notebook made for translation task ? thanks

patrickvonplaten commented 3 years ago

I'm sadly not finding the time to do so at the moment :-/

I'll put this up as a "Good First Issue" now in case someone from the community finds time to make such a notebook.

A notebook for EncoderDecoderModel translation should look very similar to this notebook: https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Leveraging_Pre_trained_Checkpoints_for_Encoder_Decoder_Models.ipynb - one only has to change the summarization dataset with a translation dataset

zmf0507 commented 3 years ago

@patrickvonplaten thanks for the update. Can you tell if there is any work on keyphrase generation /keywords generation (seq2seq task) using hugging-face ? I am looking for such tutorials and examples where I can try and play around keyphrase generation. This task is not mentioned on hugging-face notebooks page as well. Please let me know

patrickvonplaten commented 3 years ago

My best advice would be to ask this question on the forum - I sadly don't know of any work related to this

parambharat commented 3 years ago

@patrickvonplaten : Here's my attempt that modifies the condensed version of BERT2BERT.ipynb to use the wmt dataset, BLEU4 score for the en-de translation task.

Nid989 commented 2 years ago

We have now one for mBart: https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb -> will try to make one for Encoder-Decoder as well when I find time :-)

Inferring the model training details from BERT2BERT for CNN daily mail is not sufficient, we experimented with an MT model with the must-c data for en-fr , however the prediction were almost random and it was not able to understand the core meaning of its input sequence.

Nid989 commented 2 years ago

If anyone has a complete notebook based on the Encoder-Decoder model for MT, please share. Thank you.

xueqianyi commented 2 years ago

Has anyone performed the translation task correctly using bert2bert ? TAT

patrickvonplaten commented 2 years ago

@xueqianyi - maybe you have more luck on https://discuss.huggingface.co/ ?

ydshieh commented 2 years ago

Just an extra comment here: With bert2bert, it's not very helpful for MT, as BERT is only trained on English data.

desaibhargav commented 2 years ago

Hi there, I'm a Data Science grad student at Luddy. I was looking to contribute to open source in my free time and came across this issue. I did put a rough notebook together, linking it here @xueqianyi @CharizardAcademy. I would love to polish it to the standard upheld in the HF community if its indeed helpful.

Just some comments (I did NOT spend a lot of time on this, so your observations MIGHT differ):

1) The translation quality depends a lot on model capacity, though even using base BERT, the translations are fairly decent and definitely not gibberish. Tweaking the decoding parameters will help too.

2) I've trained only on 1M examples due to compute constraints, but I believe some multiples higher might work out better. I trained with 0.1M and 0.5M examples, I saw consistent improvements to the BLEU score on every increase.

3) Length of the tensors fed into the model (post-tokenization) have an impact on the translation quality too. Specifically max_length=64 and higher results in a lot of repetitions especially for short sentences because this particular dataset (1M subset) has most examples below 32 tokens (95%) (hence I recommend spending sometime tweaking the decoding parameters, no_repeat_ngram_size, max_length, length_penality etc in particular).

4) Also, the model seems to think President Obama and President Bush are the same person, EVERYTIME. xD

mahita2104 commented 1 year ago

I would like to work on this issue