Hyper-parameter in pretraining

zhangliang-04 commented 3 years ago

Hi, I found that the learning rate in pretraining Stage I released in the paper is 1e-3, and batchsize is 600. The scripts in this repo suggests 1e-4 and 1920, but usually learning rate should be increased with the batchsize. In Stage II, batchsize in the paper and the scripts is very different (48 vs 960). Consider that hyper parameters searching will take a lot of time in pretraining, I'm not sure what parameters should be used. Is there any misunderstanding?

ArrowLuo commented 3 years ago

Hi @zhangliang-04,

The lr of Stage I set as 1e-3 and 1e-4 is ok. We test both of them at Stage I and they work well. However, the 1e-4 is more stable. So we set 1e-4 when writing the README.md.

For the batch size, our principle is to fill up the GPUs due to the limited resources for pretraining. Sorry that we missed introducing the gradient_accumulation_steps of Stage II in the paper, and 48 is the forward batch size of each step.

In summary, 1) Ignore our batch size and use your GPU memory as much as possible, and set gradient_accumulation_steps to make it a little fast. 2) set both stages 1e-4 is ok.

Best,

zhangliang-04 commented 3 years ago

Thanks for your suggestions!

microsoft / UniVL

Hyper-parameter in pretraining #18