Open eryk-mazus opened 3 months ago
0.02 is ok according to Megatron-LM training loop
I noticed that in the following snippet, that the
std
ofnn.Embedding
is set to0.02
:def _init_weights(self, module): if isinstance(module, nn.Linear): std = 0.02 if hasattr(module, 'NANOGPT_SCALE_INIT'): std *= (2 * self.config.n_layer) ** -0.5 torch.nn.init.normal_(module.weight, mean=0.0, std=std) if module.bias is not None: torch.nn.init.zeros_(module.bias) elif isinstance(module, nn.Embedding): torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
The official implementation sets it to
0.01
as noted in the video. In only matters for positional embeddings due to weight sharing scheme ofwte
andlm_head
i doesnt mattet much, 0.02 kind of works
I noticed that in the following snippet, that the
std
ofnn.Embedding
is set to0.02
:The official implementation sets it to
0.01
as noted in the video. In only matters for positional embeddings due to weight sharing scheme ofwte
andlm_head