Closed jiaohuix closed 1 year ago
Hi @MiuGod0126
$\beta$ is a multiplier, so it should be:
linear2_weight = linear2.weight.detach().numpy().reshape((-1, )) * init_scale
instead of
linear2_weight = linear2.weight.detach().numpy().reshape((-1, )) / init_scale
@shumingma Ooooh! Sorry, I was careless to see mul as division, thank you for your correction!!! I understand deeper on deepnorm_init, and the corrected distribution is as follows:
I have a doubt about deepnorm. In the paper, deepnorm_init function use xaviernormal(x, gain=beta) for "ffn" "v_proj" "out_proj". However, in the source code of torhscale use xaviernormal(x, gain=1)/ beta:
`
` Although i know that X ~ N(0,std^2), aX ~ N(0,(a*std)^2), I plot the distribution of both methods using a histogram,the results show some differences between the two methods:
`
`
Is my implementation wrong? Which method should I use? I hope someone can enlighten me, thank you!!!