masakhane-io / masakhane-reading-group

Agile reading group that works
13 stars 1 forks source link

[14/05/2020] 5:15PM GMT+1 Transfer Learning for Low-Resource Neural Machine Translation #2

Closed keleog closed 4 years ago

keleog commented 4 years ago

Link - https://www.isi.edu/natural-language/mt/emnlp16-transfer.pdf

Summary:

The authors perform transfer learning for low-resourced NMT via a method that has to stages: first train a high-resource language pair (the parent model), then transfer some of the learned parameters to the low-resource pair (the child model) to initialize and constrain training. They observe improvements in BLEU scores. Also, they apply this transfer learning method to syntax-based MT and obtain improvements.

jaderabbit commented 4 years ago

My Personal Notes:

Basic Algorithm:

Outcomes

Missing knowledge:

Missing context:

Relevance/Value

keleog commented 4 years ago

Some questions:

  1. Has this been done with a transformer?
  2. Can this be transferred to different source and targets where the source and target of the child models are similar to (but different from) the source and target of the parent? Or even different parent and child.
  3. They have 2.5 million tokens as LR, which is still relatively high IMO (at least compared to the parallel data size - <= 1.5M tokens - I have seen so far for African Languages)

Found some answers: Q1 and Q2:

Yes - Trivial Transfer Learning for LR NMT (https://arxiv.org/pdf/1809.00357.pdf). The authors leave out the restriction on relatedness of the languages and extend the experiments to parent–child pairs where the target language changes. They find that even for unrelated languages, there is some transfer improvement. They conclude that it is size of training data of parent pair that matters more than similarity.

In fact, they swapped the directions of the parent and child pairs and still observed gains. Eg. XX - EN parent and EN-YY pairs. Also, the method works also for sharing the source language, not just the target language. However, embeddings were shared in the source and target directions for both parent and child, so another possible reason for transfer, which they did not investigate. Also, they find some gains when they train from source to target as parent and then transfer that from target to source as child. Eg. Train an English to Zulu NMT model and initialise the training of a Zulu to English NMT model with the previously trained English to Zulu NMT.

jaderabbit commented 4 years ago

@keleog To your #3: The Urdu dataset has 200k sentences only - which is more reflective of other african langauges

keleog commented 4 years ago

@jaderabbit Fair enough. Assuming the average number of tokens per sentence of 15, that is around 3 million tokens. But, honestly, I have not seen a parallel set of up to 100k sentences for any African language except the Setswana one reported in your one of your papers.

keleog commented 4 years ago

@jaderabbit Your missing knowledge parts are very important and I hope someone has an idea about them. I think the ensemble would be different "Xfer" models with varying parent and child and hyperparameters.

hadyelsahar commented 4 years ago

Saved chat and links from our talks:

18:17:20 From Bernardt Duvenhage : Hi. I'm in a bit of a noisy environment today. 18:17:38 From Bernardt Duvenhage : Will mostly listen in while I try to homeschool my kids :-) 18:22:00 From hady elsahar : +1 for some comments after Kelechi 18:24:57 From hady elsahar :

mikel Artetxe https://scholar.google.com/citations?hl=en&user=N5InzP8AAAAJ&view_op=list_works&sortby=pubdate http://www.mikelartetxe.com/publication/ https://www.cse.ust.hk/~qyang/Docs/2009/tkde_transfer_learning.pdf

18:25:53 From hady elsahar : Shared BPE vocabulary https://www.aclweb.org/anthology/P16-1162/ https://github.com/google/sentencepiece

18:28:57 From kelechukwu : Trivial Transfer Learning for LR NMT (https://arxiv.org/pdf/1809.00357.pdf) 18:29:48 From Jade Abbott : https://iclr.cc/virtual/poster_S1l-C0NtwS.html 18:30:28 From Jade Abbott : https://openreview.net/pdf?id=S1l-C0NtwS 18:31:10 From Bernardt Duvenhage : Read it briefly. Are the child models always the same size as the teacher model? 18:32:33 From kelechukwu : Same size - like embedding and layer dimensions, etc. ? 18:34:06 From orevaogheneahia : I have skimmed through the paper. I was wondering how to properly select the appropriate parent languages. 18:34:31 From hady elsahar : bahdnau 2015 18:34:33 From orevaogheneahia : More like how do we measure the similarity ? 18:35:20 From Jamiil Toure ALI : Hi all . I read the paper... And i didn't understand the re-correction part ? how is that implemented on the paper ? 18:35:54 From Bernardt Duvenhage : Thanks. It would be cool to see the papers on the benefit of also incorporating distillation. 18:36:21 From Jamiil Toure ALI : sorry re-scoring rahte han re-correction .. 18:39:13 From hady elsahar : https://github.com/google-research/bert/blob/master/multilingual.md 18:40:08 From kelechukwu : Multilingual Denoising Pre-training for Neural Machine Translation - https://arxiv.org/abs/2001.08210 18:40:32 From kelechukwu : Multilingual BART seems to perform well for LM transfer learning to NMT 18:46:44 From hady elsahar : https://www.aclweb.org/anthology/P19-1301.pdf 18:48:05 From orevaogheneahia : Thanks for sharing . 18:53:27 From hady elsahar : Reranking diverse candidates has been shown to improve results in both open dialog and machine translation (Li et al., 2016a; Li and Jurafsky, 2016; Gimpel et al., 2013) 18:53:32 From hady elsahar : https://www.aclweb.org/anthology/P19-1365.pdf 18:54:12 From Jamiil Toure ALI : Thanks for sharing 18:54:29 From Bernardt Duvenhage : When will next week's paper be announced? 18:55:48 From Bernardt Duvenhage : Very cool idea, yes :-) Thanks

keleog commented 4 years ago

Thanks @hadyelsahar !