Closed zhangliang-04 closed 3 years ago
Hi @zhangliang-04,
The lr
of Stage I set as 1e-3 and 1e-4 is ok. We test both of them at Stage I and they work well. However, the 1e-4 is more stable. So we set 1e-4 when writing the README.md.
For the batch size, our principle is to fill up the GPUs due to the limited resources for pretraining. Sorry that we missed introducing the gradient_accumulation_steps
of Stage II in the paper, and 48
is the forward batch size of each step.
In summary, 1) Ignore our batch size and use your GPU memory as much as possible, and set gradient_accumulation_steps
to make it a little fast. 2) set both stages 1e-4
is ok.
Best,
Thanks for your suggestions!
Hi, I found that the learning rate in pretraining Stage I released in the paper is
1e-3
, and batchsize is600
. The scripts in this repo suggests1e-4
and1920
, but usually learning rate should be increased with the batchsize. In Stage II, batchsize in the paper and the scripts is very different (48
vs960
). Consider that hyper parameters searching will take a lot of time in pretraining, I'm not sure what parameters should be used. Is there any misunderstanding?