Question on multiplying linear projection by sqrt(d_model)

evanatyourservice commented 3 years ago

Hello! Thank you for the great paper and for sharing your implementation! I have a quick question. I'm wondering why the linear projection is multiplied by the constant, square root of self.d_model, while this is not mentioned in the paper and not shown in other implementations (I don't think).

This line:

https://github.com/gzerveas/mvts_transformer/blob/fe3b539ccc2162f55cf7196c8edc7b46b41e7267/src/models/ts_transformer.py#L299

Just curious, thank you!

gzerveas commented 3 years ago

Hi, thanks for the perceptive question. The idea behind that scaling factor was to keep the magnitude of the (projected) input vectors (more exactly, their variance) within the same range as the positional encodings. Without this scaling factor, the magnitude of the projections would grow with the dimensionality d_model (consider that we start with some uniform distribution of projection weights, and the input vectors are anyway normalized), and this means that the positional encodings, which are simply added on top, might end up negligible compared to the projections. Admittedly, this is more important when using sinusoidal encodings, while the learnable encodings could be suitably adjusted by the model. Still, the learnable encodings are also initialized by Xavier, so the ranges match and this way it may be a bit easier for the model to learn them. Honestly, I am far from certain that this indeed has an observable effect in the results, I didn't systematically evaluate its effect (faster convergence) besides some initial observations, and I don't think that it would be worth mentioning and justifying in the paper, given the space limitations. But I think that it is at least harmless, so I left it in the code :) If you can experiment with both using and omitting this factor, you can let us know what you observe - my guess, not much :)

evanatyourservice commented 2 years ago

Thank you so much for your answer! That certainly makes sense, and that's a good idea to have thought of. I wrote an implementation in tf and skipped it but I'll add it in now, can't hurt like you said.

So far I've been getting very good results. I've been initializing everything in kaiming uniform (and train with NovoGrad optimizer), but have noticed the results can vary drastically between random seeds, so I've started looking into different ways to initialize networks, specifically transformers, to get more consistent results. Looks like pytorch does multiheadattention with xavier uniform, and their linear layers with kaiming uniform, so I think I'll try this setup next. Is this what you use? I think pytorch just switched their linear layers from xavier to kaiming not long ago.

I found the papers T-fixup, which inits with xavier and then shrinks the magnitudes by a constant based on number of transformer layers (but is meant for seq2seq transformers), and gradinit, which uses an optimizer to adjust the variance of each layer's init to minimize the loss after a single training step. GradInit has some impressive results, I might implement it.

Do you have any insight into initializing MTST?

gzerveas / mvts_transformer

Question on multiplying linear projection by sqrt(d_model) #3