lucidrains / x-transformers

A simple but complete full-attention transformer with a set of promising experimental features from various papers
MIT License
4.42k stars 377 forks source link

Sinusoidal embedding order choice different from original definition #252

Closed gordicaleksa closed 2 months ago

gordicaleksa commented 2 months ago

Hey Phil! One more from me! :)

I see that the way you stack sinusoidal embeddings here is different from the original transformer paper (section 3.5).

Instead of: emb = torch.cat((emb.sin(), emb.cos()), dim = -1)

What the original one does is this: emb = torch.stack((emb.sin(), emb.cos()), dim=-1).view(max_pos, -1)

i.e. your vector looks like: [sin(x1),sin(x2),...,cos(x1),cos(x2),...] whereas in the original paper it was like: [sin(x1),cos(x1),sin(x2),cos(x2),...]

again, the network will certainly learn from both of these - I was just curious has there been any empirical finding that showed that the first definition is more performant?

lucidrains commented 2 months ago

hmm not that i know of, but if you run any benchmarks, let me know