In the paper, the pretraining hyperparameters are specified as 90 epochs with 5k warmup steps. However, the uses 500 warmup steps and about 65 epochs on Imagenet1M -- the latter appear more reasonable warmup:total_epoch ration at 4096 batchsize. It is possible that these hyperparameters are only for finetuning and not pretraining. In any case, could you clarify which hyperparameters were used for the checkpoints during pretraining? TIA
In the paper, the pretraining hyperparameters are specified as 90 epochs with 5k warmup steps. However, the uses 500 warmup steps and about 65 epochs on Imagenet1M -- the latter appear more reasonable warmup:total_epoch ration at 4096 batchsize. It is possible that these hyperparameters are only for finetuning and not pretraining. In any case, could you clarify which hyperparameters were used for the checkpoints during pretraining? TIA