Closed rongcuid closed 3 years ago
I tried my model without RoPE, and the model start predicting nonsense at many timestamps. Positional embedding definitely helps, but I don't know if it is required, or I simply did not train long enough.
Hi,
Thanks for the good words about the library.
Positional encoding is still required even though the network can be implemented as an RNN. The state updates are permutation invariant which means that the order is lost after each prediction.
I will close the issue but if you need more information feel free to reopen it.
Cheers, Angelos
Thank you. My experiments also show that the model cannot learn sequence generation when positional encoding is not used.
Just noting. I did some survey, and find https://arxiv.org/pdf/1905.04226.pdf which claims autoregressive transformers do not require positional encoding. I do not have the computational capacity to try such deep models, but I suppose one can leave them out if the network is deep enough.
In my particular model, training without positional gives bad results, but mine is only 4H/4L with 128 hiddens and 1024 FF. It might simply be a case which the model is too small.
For future reference, I can confirm that positional encoding on causal model is not required. However, it requires a lot more tuning, especially in my case where multi-task learning is involved. With positional encoding applied to decoder, the model converges much faster and with more stability.
Great papers and thank you for the library. I have successfully reproduced the Transformers are RNN paper from scratch. However, I have some questions about the use of positional encoding.
Thank you for your great work.