[14/05/2020] 5:15PM GMT+1 Transfer Learning for Low-Resource Neural Machine Translation

keleog commented 4 years ago

Link - https://www.isi.edu/natural-language/mt/emnlp16-transfer.pdf

Summary:

The authors perform transfer learning for low-resourced NMT via a method that has to stages: first train a high-resource language pair (the parent model), then transfer some of the learned parameters to the low-resource pair (the child model) to initialize and constrain training. They observe improvements in BLEU scores. Also, they apply this transfer learning method to syntax-based MT and obtain improvements.

jaderabbit commented 4 years ago

My Personal Notes:

Basic Algorithm:

Train model on high resource pair (french-english)
Freeze the english embeddings (only? they try freezing more things in ablation study)
Initialize new model with prior french-english model
Train on low-resourced-lang-english. The low-resourced languages vocabulary is randomly assigned to french

Outcomes

Good BLEU score improvement across languages compared to baselin NMT
Show that choice of high-language pair to transfer from matters to final performance

Missing knowledge:

They mention an "unknown word replacement" from Luong et al 2015b. Anyone know what that is?
In the transfer results, they claim they ensemble together 8 models - "The 'Final' row". Anyone know what the hell THAT is?
What is re-scoring and how does that work? I assume it's re-scoring candidate translations in SMT?

Missing context:

Has any interpretability study been done to understand what has been learned? As in, what sort of rules is it learning - word order maybe? They show it's not just memorizing english to get the boost, so what DID it learn?
What happens with Transformers? Similar results? (I heard at ICLR that Transformers forget lots when being fine-tuned in a BERT context - ie. bad regularization)
What attributes constitute a good transfer pair? I know Neulab have done lots of work here

Relevance/Value

Very simple technique to implement with seemingly high performance gains
I wonder how it would perform with languages with totally different scripts

keleog commented 4 years ago

Some questions:

Has this been done with a transformer?
Can this be transferred to different source and targets where the source and target of the child models are similar to (but different from) the source and target of the parent? Or even different parent and child.
They have 2.5 million tokens as LR, which is still relatively high IMO (at least compared to the parallel data size - <= 1.5M tokens - I have seen so far for African Languages)

Found some answers: Q1 and Q2:

Yes - Trivial Transfer Learning for LR NMT (https://arxiv.org/pdf/1809.00357.pdf). The authors leave out the restriction on relatedness of the languages and extend the experiments to parent–child pairs where the target language changes. They find that even for unrelated languages, there is some transfer improvement. They conclude that it is size of training data of parent pair that matters more than similarity.

In fact, they swapped the directions of the parent and child pairs and still observed gains. Eg. XX - EN parent and EN-YY pairs. Also, the method works also for sharing the source language, not just the target language. However, embeddings were shared in the source and target directions for both parent and child, so another possible reason for transfer, which they did not investigate. Also, they find some gains when they train from source to target as parent and then transfer that from target to source as child. Eg. Train an English to Zulu NMT model and initialise the training of a Zulu to English NMT model with the previously trained English to Zulu NMT.

jaderabbit commented 4 years ago

@keleog To your #3: The Urdu dataset has 200k sentences only - which is more reflective of other african langauges

keleog commented 4 years ago

@jaderabbit Fair enough. Assuming the average number of tokens per sentence of 15, that is around 3 million tokens. But, honestly, I have not seen a parallel set of up to 100k sentences for any African language except the Setswana one reported in your one of your papers.

keleog commented 4 years ago

@jaderabbit Your missing knowledge parts are very important and I hope someone has an idea about them. I think the ensemble would be different "Xfer" models with varying parent and child and hyperparameters.

hadyelsahar commented 4 years ago

Saved chat and links from our talks:

18:17:20 From Bernardt Duvenhage : Hi. I'm in a bit of a noisy environment today. 18:17:38 From Bernardt Duvenhage : Will mostly listen in while I try to homeschool my kids :-) 18:22:00 From hady elsahar : +1 for some comments after Kelechi 18:24:57 From hady elsahar :

mikel Artetxe https://scholar.google.com/citations?hl=en&user=N5InzP8AAAAJ&view_op=list_works&sortby=pubdate http://www.mikelartetxe.com/publication/ https://www.cse.ust.hk/~qyang/Docs/2009/tkde_transfer_learning.pdf

18:25:53 From hady elsahar : Shared BPE vocabulary https://www.aclweb.org/anthology/P16-1162/ https://github.com/google/sentencepiece

18:28:57 From kelechukwu : Trivial Transfer Learning for LR NMT (https://arxiv.org/pdf/1809.00357.pdf) 18:29:48 From Jade Abbott : https://iclr.cc/virtual/poster_S1l-C0NtwS.html 18:30:28 From Jade Abbott : https://openreview.net/pdf?id=S1l-C0NtwS 18:31:10 From Bernardt Duvenhage : Read it briefly. Are the child models always the same size as the teacher model? 18:32:33 From kelechukwu : Same size - like embedding and layer dimensions, etc. ? 18:34:06 From orevaogheneahia : I have skimmed through the paper. I was wondering how to properly select the appropriate parent languages. 18:34:31 From hady elsahar : bahdnau 2015 18:34:33 From orevaogheneahia : More like how do we measure the similarity ? 18:35:20 From Jamiil Toure ALI : Hi all . I read the paper... And i didn't understand the re-correction part ? how is that implemented on the paper ? 18:35:54 From Bernardt Duvenhage : Thanks. It would be cool to see the papers on the benefit of also incorporating distillation. 18:36:21 From Jamiil Toure ALI : sorry re-scoring rahte han re-correction .. 18:39:13 From hady elsahar : https://github.com/google-research/bert/blob/master/multilingual.md 18:40:08 From kelechukwu : Multilingual Denoising Pre-training for Neural Machine Translation - https://arxiv.org/abs/2001.08210 18:40:32 From kelechukwu : Multilingual BART seems to perform well for LM transfer learning to NMT 18:46:44 From hady elsahar : https://www.aclweb.org/anthology/P19-1301.pdf 18:48:05 From orevaogheneahia : Thanks for sharing . 18:53:27 From hady elsahar : Reranking diverse candidates has been shown to improve results in both open dialog and machine translation (Li et al., 2016a; Li and Jurafsky, 2016; Gimpel et al., 2013) 18:53:32 From hady elsahar : https://www.aclweb.org/anthology/P19-1365.pdf 18:54:12 From Jamiil Toure ALI : Thanks for sharing 18:54:29 From Bernardt Duvenhage : When will next week's paper be announced? 18:55:48 From Bernardt Duvenhage : Very cool idea, yes :-) Thanks

keleog commented 4 years ago

Thanks @hadyelsahar !

masakhane-io / masakhane-reading-group

[14/05/2020] 5:15PM GMT+1 Transfer Learning for Low-Resource Neural Machine Translation #2