Open tonysy opened 2 years ago
Hi, we follow the setting in deit that learning rate= 5e-4 * batch_size /512, the batch size in our code is 256 per GPU, so its total batch size is 2048 with lr 2e-3, the learning rate and batch size in the paper is 1e-3 and 1024, so they are nearly the same,
Hi, I have noticed the hyper-parameter configuration used in the code is inconsistent with the arxiv report
I'm wondering whether this inconsistency makes the comparison unfair?