Sinusoidal embedding order choice different from original definition

Hey Phil! One more from me! :)

I see that the way you stack sinusoidal embeddings here is different from the original transformer paper (section 3.5).

Instead of: emb = torch.cat((emb.sin(), emb.cos()), dim = -1)

What the original one does is this: emb = torch.stack((emb.sin(), emb.cos()), dim=-1).view(max_pos, -1)

i.e. your vector looks like: [sin(x1),sin(x2),...,cos(x1),cos(x2),...] whereas in the original paper it was like: [sin(x1),cos(x1),sin(x2),cos(x2),...]

again, the network will certainly learn from both of these - I was just curious has there been any empirical finding that showed that the first definition is more performant?

lucidrains / x-transformers

Sinusoidal embedding order choice different from original definition #252