Closed ever4244 closed 4 years ago
Hello,
The main difference is in the training loop, which is roughly:
With RoundRobinZipDatasets:
for i in range(len(epoch)):
for lang_pair in args.lang_pairs:
batch = next_batch_for_lang_pair(lang_pair)
loss = criterion(model_for_lang_pair(lang_pair), batch)
loss.backward()
optimizer.step()
With MultiCorpusSampledDataset:
for i in range(len(epoch)):
lang_pair = sample_one_lang_pair(args.lang_pairs)
batch = next_batch_for_lang_pair(lang_pair)
loss = criterion(model_for_lang_pair(lang_pair), batch)
loss.backward()
optimizer.step()
So effectively, you have a smaller total number of model updates with bigger batch size (multiple batches from all lang pairs) with RoundRobinZipDatasets, vs. a bigger number of updates with smaller batch size (one batch from just 1 random lang pair) with MultiCorpusSampledDataset.
My initial implementation (before January 2020) was closer to fairseq's multilingual_translation, except you also need to include language ID embeddings. I changed to MultiCorpusSampledDataset after reading from the LASER's blog post, which stated that:
We trained our system on 223 million sentences of public parallel data, aligned with either English or Spanish. For each mini-batch, we randomly chose an input language and trained the system to translate the sentences into English or Spanish.
I managed to get a closer result to LASER's paper using this way, but I would think that both ways are viable, it's kind of empirical. :)
Thank you for the insight Does the performance difference a significant one?
multilingual_translation task is attractive for me because I can choose whether to use a shared encoder or decoder easily, this made a more flexible model structure viable. And I can do a batch of experiments on the different model structures using the same task framework.
Yes, I notice that the two methods are mainly different at how to do the mini-batch and update. So would it a viable option for me to use your laser_dataset, in combination with the multilingual_translation task?
I notice the three major differences compared with the standard multilingual_translation setting that I need to add to my current code, one is the laser_dataset (MultiCorpusSampledDataset), the other is the language id on the decoder input, and finally a sentence embedding concatenation to the decoder input.
Is there anything else that I should be aware of? Regards!
I don't recall doing the BUCC experiments using the multilingual_translation task so I am not too sure about the performance difference. My implementation for LaserDataset is based on MultiCorpusSampledDataset, so it may not be suitable to use directly with Fairseq's multilingual_task (it needs RoundRobinZipDatasets). I guess you can check multilingual_translation.py in Fairseq, and find some ways to include the language ID information when calling your model.
I don't recall doing the BUCC experiments using the multilingual_translation task so I am not too sure about the performance difference. My implementation for LaserDataset is based on MultiCorpusSampledDataset, so it may not be suitable to use directly with Fairseq's multilingual_task (it needs RoundRobinZipDatasets). I guess you can check multilingual_translation.py in Fairseq, and find some ways to include the language ID information when calling your model.
What I did is to tie language id with sentemb from the encoder, and shut down the real encoder output to the decoder (I used a transformer structure). So instead of attention on the encoder output, the decoder attention only attention on the sentemb and language id.
I once tie the language id with x (the decoder side input), but then I thought it would be faster and cheaper to just concatenate sentemb with language embedding.
Because I am using a new structure (transformer) I am testing different ways to adapt laser into that. Currently, I am considering whether I should use "sharing encoder and decoder input embedding" Have you ever used --share-decoder-input-output-embed?
Thank you for the help! Will share to you my finding if you are interested, once all my experiments are done.
Hello Raymond:
Long time no see!
I have a general question: Why not using the multilingual_translation task directly? for example, would a setting like this take a similar effect? with a few modifications to the laser_lst model? --task multilingual_translation --arch laser \ --lang-pairs de-en,de-es,en-es,es-en,fr-en,fr-es \ --share-decoders --share-decoder-input-output-embed \
I am asking this is because I also implemented a baseline of transformer structure with the setting below --task multilingual_translation --arch multilingual_transformer_iwslt_de_en \ --lang-pairs de-en,de-es,en-es,es-en,fr-en,fr-es \ --share-decoders --share-decoder-input-output-embed \
What I want to achieve is to compare the difference between LASER's LSTM structure with Transformer's attention based structure. When I doing this, I find the only major difference between the laser task and multilingual_translatio task is that the former uses a "multi_corpus_sampled_dataset" based laser data while the latter uses a "round_robin_zip_datasets dataset. And multilingual_translation provides a more general purposed setting.
So would there be a performance difference between round_robin_zip and multi_corpus_sampled method in this task? I think the sampling is uniform, therefore, in theory, they should be roughly the same?
Why do you want to implement a LASER task instead of directly using the default multilingual_translation task provided by fairseq, and setting encoder and decoder shared by all languages and using the same dictionary? Do I miss something?
Thank you very much!