Closed xidulu closed 2 hours ago
I noticed that the weight init for MLP and attention heads are normalized by n_embd/fan-in, different from the original NanoGPT code https://github.com/tml-epfl/why-weight-decay/blob/main/large_language_models/model.py#L162
But this piece of code is not actually used in experiments right? custom_init seems to always be turned off in all config files.
Right, it wasn't used in the experiments. I added this piece of code for some internal experiments that we didn't report in the paper.
Thanks!
I noticed that the weight init for MLP and attention heads are normalized by n_embd/fan-in, different from the original NanoGPT code https://github.com/tml-epfl/why-weight-decay/blob/main/large_language_models/model.py#L162
But this piece of code is not actually used in experiments right? custom_init seems to always be turned off in all config files.