Open justHungryMan opened 2 years ago
Hi Sangjun
A couple of differences:
Final validation top-1 accuracy was 74.79% and the training curves looked as follows:
HTH, Andreas
@andsteing Thank you for your answer
I'll be back after 3 days for train... :)
Hi @andsteing. Thanks to you, I was able to do the experiment well. For users who want to train from scratch, I opened repo (https://github.com/justHungryMan/vision-transformer-tf). (Please let me know if this is a problem.)
I've discovered a few things while doing the experiment, and I'd like to hear your opinions.
Val accuracy in upstream is not directed to higher accuracy in downstream. When I train b/16 with lr=0.003, it reached 74.37% in upstream (imagenet1k, bs=1024, call it A) and lr=0.00075 reached 74.52% in upstream(call it B) but A reached 74.896% in downstream and B reached 72.467%. It is interesting but hard to understand why.
When finetuing, the public code only uses "resize" in val_set, but resize to 416 and center crop to 384 gives higher accuracy (74.896% -> 75.945%). I wanna ask why you use "resize" on val_set?
Hi, @andsteing . I'm trying to train ViT-b16 from scratch on ImageNet using SAM optimizer. Could you share your training details? From paper, I believe it is using the same as base optimizer with a rho=0.2. However, I can't get anywhere near 79.9%.
@xiangning-chen who added SAM checkpoints in #119
Hi @yzlnew , how many machines are you using to train the model? This essentially determine the distributed level of SAM, which corresponds to the m-sharpness discussed in section 4.1 here. For my experiments, I used 64 TPU chips. If you are using fewer machines, my experience is to enlarge the rho.
@xiangning-chen Thanks for clarification! I have tried training on 4/8/32 A100 cards with rho=0.2. And I also notice that a larger rho can improve the performance in other experiments.
Thanks for your work and Detailed answer in issues. I am reproducing the ViT B-16 in Tensorflow based on your Paper and answers. (In this issue, I only deal with original ViT paper) But I just reached about 47% on imagenet1k validation set (Upstream) I want to know if my experimental conditions are incorrect, and I hope this issue helps others to reproduce ViT from cratch.
Here is my train_loss, val_loss, train_acc, val_acc, lr curve
Train loss
Val Loss
Train acc
Val acc
LR
I don't know if you can afford to look at the code, but here is my code .