Effect of equalized learning rate in generator architecture

hsi1032 commented 2 years ago

Hi, thanks for this great work!

In generator code, mapping network and AdaIN uses EqualLinear from StyleGAN2, and transformer block uses nn.Linear.

I know this configuration may follow the original implementation of mapping network and attention block, but I wonder if this component affects the image generation performance.

E.g. FID when using EqualLinear in qkv of attention block

Do you have any idea of the effects of equalized learning rate in transformer block?

Thanks,

ForeverFancy commented 2 years ago

Hi, thanks for the interest of our work. In the early exploration, we have tried to use EqualLinear in G, however, the training is very unstable and would cause model collapse, so that the fid is much higher than the final configuration. Therefore, we adopted nn.Linear with careful initialization and the training became much more stable than before.

The training and initialization processes for CNN and transformer are greatly different (same in discriminative tasks like classification). Equalized learning rate was proposed to stabilize the training of CNN generators, which may not be suitable for transformer generators. So we think the transformer can only demonstrate its powerful capabilities with the help of an appropriate initialization and training approach.

hsi1032 commented 2 years ago

Your comment makes me think that it can be interesting to inspect the training instability of transformer in the generative model and its relationship to the learning rate.

Thank you for your kind reply!

microsoft / StyleSwin

Effect of equalized learning rate in generator architecture #8