microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.25k stars 90 forks source link

ResNet readout_zero_init=True? #13

Closed D-X-Y closed 2 years ago

D-X-Y commented 2 years ago

Dear Greg,

Awesome project! May I ask why the linear output layer in ResNet is initialized as 0 instead of Gaussian(mean=0, var=1) as mentioned in the paper?

Thanks a lot for your time and help.

thegregyang commented 2 years ago

Thanks @D-X-Y!

We discuss this in the paper in Section D.2.

image

Let us know if you have any question after reading this.

D-X-Y commented 2 years ago

Thanks a lot for your explaination!