karpathy / build-nanogpt

Video+code lecture on building nanoGPT from scratch
3.44k stars 473 forks source link

Embeddings are initialized with std of 0.02 #18

Open eryk-mazus opened 3 months ago

eryk-mazus commented 3 months ago

I noticed that in the following snippet, that the std of nn.Embedding is set to 0.02:

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            std = 0.02
            if hasattr(module, 'NANOGPT_SCALE_INIT'):
                std *= (2 * self.config.n_layer) ** -0.5
            torch.nn.init.normal_(module.weight, mean=0.0, std=std)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

The official implementation sets it to 0.01 as noted in the video. In only matters for positional embeddings due to weight sharing scheme of wte and lm_head

peter-ni-noob commented 3 months ago

0.02 is ok according to Megatron-LM training loop

melqtx commented 3 months ago

I noticed that in the following snippet, that the std of nn.Embedding is set to 0.02:

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            std = 0.02
            if hasattr(module, 'NANOGPT_SCALE_INIT'):
                std *= (2 * self.config.n_layer) ** -0.5
            torch.nn.init.normal_(module.weight, mean=0.0, std=std)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

The official implementation sets it to 0.01 as noted in the video. In only matters for positional embeddings due to weight sharing scheme of wte and lm_head

i doesnt mattet much, 0.02 kind of works