Closed ChuanyangZheng closed 3 years ago
Thank you for asking! As we mentioned in the paper, we omit the word embedding lookup table from the model parameters. : )
Thank you very much for you kind reply. However, you might get my point. I wonder how you compress the original Transformer into different model size in Table 1. For example, the smallest 2.8M transfomer is much samller than original Transfomer size 45M(not counting word embedding).
Thank you for asking! As we mentioned in the paper, we shrink the embedding size of the model to reduce the number of parameters, following the settings in the evolved transformer.
Hello, I am confused in your results on WMT’14 En-De and WMT’14 En-Fr: I wonder how you get transformer proposed by Vaswaniet al. (2017) for WMT with different paramters such as 2.8M, 5.7M, by pruning I guess?