raymondhs / fairseq-laser

My implementation of LASER architecture in Fairseq
MIT License
12 stars 6 forks source link

A general question: Why not using the multilingual_translation task directly? #5

Closed ever4244 closed 4 years ago

ever4244 commented 4 years ago

Hello Raymond:

Long time no see!

I have a general question: Why not using the multilingual_translation task directly? for example, would a setting like this take a similar effect? with a few modifications to the laser_lst model? --task multilingual_translation --arch laser \ --lang-pairs de-en,de-es,en-es,es-en,fr-en,fr-es \ --share-decoders --share-decoder-input-output-embed \

I am asking this is because I also implemented a baseline of transformer structure with the setting below --task multilingual_translation --arch multilingual_transformer_iwslt_de_en \ --lang-pairs de-en,de-es,en-es,es-en,fr-en,fr-es \ --share-decoders --share-decoder-input-output-embed \

What I want to achieve is to compare the difference between LASER's LSTM structure with Transformer's attention based structure. When I doing this, I find the only major difference between the laser task and multilingual_translatio task is that the former uses a "multi_corpus_sampled_dataset" based laser data while the latter uses a "round_robin_zip_datasets dataset. And multilingual_translation provides a more general purposed setting.

So would there be a performance difference between round_robin_zip and multi_corpus_sampled method in this task? I think the sampling is uniform, therefore, in theory, they should be roughly the same?

Why do you want to implement a LASER task instead of directly using the default multilingual_translation task provided by fairseq, and setting encoder and decoder shared by all languages and using the same dictionary? Do I miss something?

Thank you very much!

raymondhs commented 4 years ago

Hello,

The main difference is in the training loop, which is roughly:

With RoundRobinZipDatasets:

for i in range(len(epoch)):
  for lang_pair in args.lang_pairs:
    batch = next_batch_for_lang_pair(lang_pair)
    loss = criterion(model_for_lang_pair(lang_pair), batch)
    loss.backward()
  optimizer.step()

With MultiCorpusSampledDataset:

for i in range(len(epoch)):
  lang_pair = sample_one_lang_pair(args.lang_pairs)
  batch = next_batch_for_lang_pair(lang_pair)
  loss = criterion(model_for_lang_pair(lang_pair), batch)
  loss.backward()
  optimizer.step()

So effectively, you have a smaller total number of model updates with bigger batch size (multiple batches from all lang pairs) with RoundRobinZipDatasets, vs. a bigger number of updates with smaller batch size (one batch from just 1 random lang pair) with MultiCorpusSampledDataset.

My initial implementation (before January 2020) was closer to fairseq's multilingual_translation, except you also need to include language ID embeddings. I changed to MultiCorpusSampledDataset after reading from the LASER's blog post, which stated that:

We trained our system on 223 million sentences of public parallel data, aligned with either English or Spanish. For each mini-batch, we randomly chose an input language and trained the system to translate the sentences into English or Spanish.

I managed to get a closer result to LASER's paper using this way, but I would think that both ways are viable, it's kind of empirical. :)

ever4244 commented 4 years ago

Thank you for the insight Does the performance difference a significant one?

multilingual_translation task is attractive for me because I can choose whether to use a shared encoder or decoder easily, this made a more flexible model structure viable. And I can do a batch of experiments on the different model structures using the same task framework.

Yes, I notice that the two methods are mainly different at how to do the mini-batch and update. So would it a viable option for me to use your laser_dataset, in combination with the multilingual_translation task?

I notice the three major differences compared with the standard multilingual_translation setting that I need to add to my current code, one is the laser_dataset (MultiCorpusSampledDataset), the other is the language id on the decoder input, and finally a sentence embedding concatenation to the decoder input.

Is there anything else that I should be aware of? Regards!

raymondhs commented 4 years ago

I don't recall doing the BUCC experiments using the multilingual_translation task so I am not too sure about the performance difference. My implementation for LaserDataset is based on MultiCorpusSampledDataset, so it may not be suitable to use directly with Fairseq's multilingual_task (it needs RoundRobinZipDatasets). I guess you can check multilingual_translation.py in Fairseq, and find some ways to include the language ID information when calling your model.

ever4244 commented 4 years ago

I don't recall doing the BUCC experiments using the multilingual_translation task so I am not too sure about the performance difference. My implementation for LaserDataset is based on MultiCorpusSampledDataset, so it may not be suitable to use directly with Fairseq's multilingual_task (it needs RoundRobinZipDatasets). I guess you can check multilingual_translation.py in Fairseq, and find some ways to include the language ID information when calling your model.

What I did is to tie language id with sentemb from the encoder, and shut down the real encoder output to the decoder (I used a transformer structure). So instead of attention on the encoder output, the decoder attention only attention on the sentemb and language id.

I once tie the language id with x (the decoder side input), but then I thought it would be faster and cheaper to just concatenate sentemb with language embedding.

Because I am using a new structure (transformer) I am testing different ways to adapt laser into that. Currently, I am considering whether I should use "sharing encoder and decoder input embedding" Have you ever used --share-decoder-input-output-embed?

Thank you for the help! Will share to you my finding if you are interested, once all my experiments are done.