Open alexm-gc opened 2 years ago
Hi @alexm-gc
Thank you for your questions.
For Question 0. As opposed to the architectures presented in table 2, the Reformer is dedicated to Transformers. In addition, note that the Reformer does Y_1 = X_1 + Attention(X_2) and then Y_2 = Y_1 + FF(Y_1) Thus, it uses two successive layers (Attention and FF) in its forward rule, which is not the case for the architectures considered in Table 2.
For Question 1. Actually in our paper we do not conduct experiments on Transformers though one can define the momentum counterpart of any Transformer.
Thanks for your interesting work!
The Reformer uses RevNet in a clever way. They double the dimension of
x
such that forx1,x2=split(x)
bothx1
andx2
have the same dimension as the originalx
. This gives their invertible architecture the "same parameters" as the initial architecture. Let's call this ReformerRevNet.Question 0. In Table 2, RevNet differs to MomentumNet only in the row "same parameters". I don't see why ReformerRevNet and MomNet would be different in Table 2?
Question 1. Is there any reason this ReformerRevNet baseline was not included?
Apologies for any misunderstanding.