MuP/Fan-in initialization is used in GPT code

tml-epfl / why-weight-decay

Why Do We Need Weight Decay in Modern Deep Learning? [arXiv, Oct 2023]

https://arxiv.org/abs/2310.04415

Other

41 stars 0 forks source link

Closed xidulu closed 2 hours ago

xidulu commented 2 hours ago

I noticed that the weight init for MLP and attention heads are normalized by n_embd/fan-in, different from the original NanoGPT code https://github.com/tml-epfl/why-weight-decay/blob/main/large_language_models/model.py#L162

But this piece of code is not actually used in experiments right? custom_init seems to always be turned off in all config files.

max-andr commented 2 hours ago

Right, it wasn't used in the experiments. I added this piece of code for some internal experiments that we didn't report in the paper.

xidulu commented 2 hours ago

Thanks!