Verify Initialization of Transformer Model Components is good/appropriate.

jbloomAus commented 1 year ago

I think I haven't adequately look into this so will do so briefly.

[x] Make sure I understand what I want in theory
[x] Visualize init for a few models
[x] Consider adding a viz during/after training to wandb (maybe it's useful).

If I have reason to think this will improve my model:

[x] make fixes if needed
[x] investigate effects

Edit: Adding to this card a task to optimize weight decay. We'll parameterize this.

[x] Add weight decay groups function to remove embedding and layernorm from weight decay
[x] Test this
[x] Investigate effects on model training.

jbloomAus commented 1 year ago

some relevant info: https://d2l.ai/chapter_multilayer-perceptrons/numerical-stability-and-init.html
looking at transformer lens: init to 0.8 / sqrt(d_model)

What Neel says in the T-lens docstring for HookedTransformer.init_weights()

        Initialize weights matrices with a normal of std=initializer_range (default=0.02). This roughly follows the GPT-2 paper's scheme (but with truncation, and not halving the std for W_pos).

        LayerNorm weights are already initialized to 1.0, and all biases are initialized to 0.0 (including LayerNorm), so this just initializes weight matrices.

        Weight matrices are set to empty by default (to save space + compute, since they're the bulk of the parameters), so it is important to call this if you are not loading in pretrained weights! Note that this function assumes that weight names being with W_

        Set seed here to ensure determinism.

        This does NOT follow the PyTorch scheme, which as far as I can tell is super out of date but no one has gotten round to updating it?
        https://github.com/pytorch/pytorch/issues/18182

        PyTorch Transformers are especially bad - TransformerEncoder initializes all layers to the exact same weights?! https://github.com/pytorch/pytorch/issues/72253

        The best paper I've found on transformer initialization is the muP paper, but haven't integrated those ideas yet: https://arxiv.org/abs/2203.03466

jbloomAus commented 1 year ago

Related idea: plot L2 norms of residual stream during training or check for saturation. Look at how these shift during training...

jbloomAus commented 1 year ago

I've decided I want to emulate MinGPT / Othello GPT as this seems most reasonable. There are some subtleties I've been getting wrong (theoretically?) which I want to rectify.

What is the initialization strategy we want? here.

All linear layers are initialized with mean 0 and std 0.02. If there is a bias, initialize it to 0.
Embedding layers are initialized with mean 0 and std 0.02.
LayerNorm (if used) is initialized to have zero bias and a weight of 1. (ie: no norm initially).

What is the decay strategy we want? here.

(I'll make an arg to toggle this so we can compare to old strategy?)

Linear layers have their weight decayed. Bias is never decayed.
LayerNorm and Embedding will have their weight decayed.

Since my naive embedding strategy uses a linear layer to embed the observation, I will need to overrule this to be true to the strategy of MinGPT.

I'll quickly work out why he's doing things this way.

Why init linear layers with mean 0 and std 0.02? Found to work well in practice. Balances vanishing gradients and exploding gradients. I should use LN if I think this is a problem for me.
Why decay all linear layers but not embedding layernorm? Embedding needs to represent tokens which if it doesn't, really fucks with the model. LayerNorm (if used) is important for stability and applying weight decay to it will encourage a lack of normalization which you want if you're using layer norm.

jbloomAus commented 1 year ago

Old init looked very messy. This is better. (horizontal line in std was at -log(0.02) not -ln(0.02) which I've fixed.

jbloomAus commented 1 year ago

I don't think we need this plot in wandb, but I will add it to the github repo anyway.

jbloomAus commented 1 year ago

Adding weight decay groups. Might break reporting some of arg combinations but I'll deal with that as it comes.

jbloomAus / DecisionTransformerInterpretability

Verify Initialization of Transformer Model Components is good/appropriate. #65