Thanks for sharing this awesome work. Could you please also provide the batch size, the hyper-parameters for the optimizer and the decay steps of the cosine learning rate?
The batch size is 256. I used the default hyper-parameter of SGD in PyTorch (momentum=0.9). As for the learning rate, I just used the half cosine decayed learning rate, which does not have decay steps.
Hi,
Thanks for sharing this awesome work. Could you please also provide the batch size, the hyper-parameters for the optimizer and the decay steps of the cosine learning rate?