Reproducing Figure 1 using 'examples/Transformer/main.py'

Hi, thank you for maintaining this great repo! We are currently exploring how muP interacts with our unit scaling method, and whether there is a scheme that satisfies both at once.

I have tried to recreate the RHS of your Figure 1 using examples/Transformer/main.py to serve as our baseline. Whilst my results look sensible (nice stable optimal learning rate across varying widths, pleasing tick shape) I have been unable to choose hyperparameters that exactly recreate your plot. In particular my training losses are higher (e.g., width 128 gets to minimum training loss 5.2 whilst yours has a minimum around ~4.75) and my optimal learning rate is slightly different.

I am using the default arguments from main.py except where they are contradicted by the paper's description of Fig. 1. Can you point to a description of the training parameters you used for Fig 1, or highlight which of the below might be incorrect?

Param	Val	Reason
ffn_ratio	4	Section 3, pg5
epochs	5	Section 3, pg5
optimizer	'muadam'	as per Fig 1 caption
norm	postnorm	as per Fig 18 caption
base width	128	used by the other transformer experiments in the paper
output_mult	1	default
nlayers	2	default
nhead	2	default
batch_size	20	default
bptt	35	default
dropout	0.2	default
etc...	...	deafult

Thanks very much. My plot is quite close to yours already, but we would prefer to know our results are directly comparable, and would therefore like to be able to exactly recreate your figure for the baseline.

microsoft / mup

Reproducing Figure 1 using 'examples/Transformer/main.py' #69