Will sharing encoder/decoder parameters for two languages hurt performance?

ChenYutongTHU commented 3 years ago

Hi, thanks for this interesting work :)

I notice in your paper that you share encoder/decoder parameters for two lanuages and use a single transformer to translate bidirectionally with language-type embeddings identifying the src/tgt language. I wonder if this can hurt performance due to capacity dilution. When using a single encoder-decoder to process two directions at a time, is it better to double the transformer's number of layers from 6 to 12?

I'm also interested in the baseline implementation. In the COPY and BACK baselines, are translation in two directions implemented by two separate transformers or a single transformer? It appears truly amazing that your proposed approach defeats DALI and DAFE where the 6-layered encoder/decoder is sole for only one language.

Thanks~

jind11 commented 3 years ago

Hi, I am sorry for the late response since I have been busy with a conference ddl. Here are the answers:

The design of using one single transformer to process two directions of translation is actually following the XLM work. And the benefit is to reduce the parameter size and if the transform learn both languages, maybe it can better learn the language structure since many languages may have some grammar/structure in common.
The COPY and BACK baselines use two separate transformers for two directions. Let me know if you have any more questions.

ChenYutongTHU commented 3 years ago

Thanks for your reply! Your answers really got me thinking a lot. It makes sense that sharing network might not hurt performance as your experiment is conducted on EN-DE and EN-RO which are linguistically similar. Thanks :)

jind11 / DAMT

Will sharing encoder/decoder parameters for two languages hurt performance? #1