microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.24k stars 88 forks source link

Reproducing Figure 1 using 'examples/Transformer/main.py' #69

Open jndean opened 5 months ago

jndean commented 5 months ago

Hi, thank you for maintaining this great repo! We are currently exploring how muP interacts with our unit scaling method, and whether there is a scheme that satisfies both at once.

I have tried to recreate the RHS of your Figure 1 using examples/Transformer/main.py to serve as our baseline. Whilst my results look sensible (nice stable optimal learning rate across varying widths, pleasing tick shape) I have been unable to choose hyperparameters that exactly recreate your plot. In particular my training losses are higher (e.g., width 128 gets to minimum training loss 5.2 whilst yours has a minimum around ~4.75) and my optimal learning rate is slightly different.

I am using the default arguments from main.py except where they are contradicted by the paper's description of Fig. 1. Can you point to a description of the training parameters you used for Fig 1, or highlight which of the below might be incorrect?

Param Val Reason
ffn_ratio 4 Section 3, pg5
epochs 5 Section 3, pg5
optimizer 'muadam' as per Fig 1 caption
norm postnorm as per Fig 18 caption
base width 128 used by the other transformer experiments in the paper
output_mult 1 default
nlayers 2 default
nhead 2 default
batch_size 20 default
bptt 35 default
dropout 0.2 default
etc... ... deafult

Thanks very much. My plot is quite close to yours already, but we would prefer to know our results are directly comparable, and would therefore like to be able to exactly recreate your figure for the baseline.