Open ChenYutongTHU opened 3 years ago
Hi, I am sorry for the late response since I have been busy with a conference ddl. Here are the answers:
Thanks for your reply! Your answers really got me thinking a lot. It makes sense that sharing network might not hurt performance as your experiment is conducted on EN-DE and EN-RO which are linguistically similar. Thanks :)
Hi, thanks for this interesting work :)
I notice in your paper that you share encoder/decoder parameters for two lanuages and use a single transformer to translate bidirectionally with language-type embeddings identifying the src/tgt language. I wonder if this can hurt performance due to capacity dilution. When using a single encoder-decoder to process two directions at a time, is it better to double the transformer's number of layers from 6 to 12?
I'm also interested in the baseline implementation. In the COPY and BACK baselines, are translation in two directions implemented by two separate transformers or a single transformer? It appears truly amazing that your proposed approach defeats DALI and DAFE where the 6-layered encoder/decoder is sole for only one language.
Thanks~