idiap / fast-transformers

Pytorch library for fast transformer implementations
1.65k stars 179 forks source link

For recurrent models, are positional embeddings required? #102

Closed rongcuid closed 3 years ago

rongcuid commented 3 years ago

Great papers and thank you for the library. I have successfully reproduced the Transformers are RNN paper from scratch. However, I have some questions about the use of positional encoding.

  1. For the Causal Linear attention, I see that it is implemented as a recurrent model. Does this imply that positional embedding is not required, like "normal" RNN models?
  2. For the non-causal attention (such as when used as an encoder that sees the entire sequence), I am using Rotary Positional Embedding. Do you have any comments? Would it be better to use causal attention on the encoder as well?

Thank you for your great work.

rongcuid commented 3 years ago

I tried my model without RoPE, and the model start predicting nonsense at many timestamps. Positional embedding definitely helps, but I don't know if it is required, or I simply did not train long enough.

angeloskath commented 3 years ago

Hi,

Thanks for the good words about the library.

Positional encoding is still required even though the network can be implemented as an RNN. The state updates are permutation invariant which means that the order is lost after each prediction.

I will close the issue but if you need more information feel free to reopen it.

Cheers, Angelos

rongcuid commented 3 years ago

Thank you. My experiments also show that the model cannot learn sequence generation when positional encoding is not used.

rongcuid commented 3 years ago

Just noting. I did some survey, and find https://arxiv.org/pdf/1905.04226.pdf which claims autoregressive transformers do not require positional encoding. I do not have the computational capacity to try such deep models, but I suppose one can leave them out if the network is deep enough.

In my particular model, training without positional gives bad results, but mine is only 4H/4L with 128 hiddens and 1024 FF. It might simply be a case which the model is too small.

rongcuid commented 3 years ago

For future reference, I can confirm that positional encoding on causal model is not required. However, it requires a lot more tuning, especially in my case where multi-task learning is involved. With positional encoding applied to decoder, the model converges much faster and with more stability.