Open yyou1996 opened 3 years ago
Note that the published checkpoints were pretrained on imagenet21k. Pretraining accuracy on the validation set at the endof the pretraining was:
name | val_prec_1 |
---|---|
ViT-B_16 | 47.88% |
ViT-B_32 | 44.04% |
ViT-L_16 | 49.90% |
ViT-L_32 | 45.42% |
ViT-H_14 | 49.06% |
@andsteing hello, do you have, by any chance, the acc of pretraining on imagenet1k? I mean pretraining from absolute scratch, not finetuning. I reached 48.4% using ViT-B_16 on the imanget1k validation set, would like to have reference if you have it.
Sure, results after 300 ep (edit: L/32 and L/16 were trained for 90 ep) training on i1k from scratch are below:
name | val_prec_1 |
---|---|
ViT-B/32 i1k | 69.19% |
ViT-B/16 i1k | 74.79% |
ViT-L/32 i1k | 66.90% |
ViT-L/16 i1k | 72.59% |
thanks. I trained much less ep. will try again.
For training ViT from scratch you'll find that data augmentation and model regularization really help with medium-sized datasets (such as ImageNet, and ImageNet-21k), but with even longer training schedules (1000 ep for ImageNet, and 300 ep for ImageNet-21k):
thanks for info.
@andsteing What is the loss function you use when doing pretraining? Is it the same as what you do in finetuning? I have seen people use semantic loss, is it necessary to reach sota?
We used sigmoid crossentropy during pre-training (and we're using softmax crossentropy for fine-tuning).
Hi @andsteing,
afaik, normally we use sigmoid ce loss in multi-label task, since we assume that the labels are independent, and we use softmax ce loss in single label task, since we are looking for the max
class. And these two are actually the same in binary classification.
pretraining is not a multi-label task, why do you use sigmoid ce loss?
I found out that if I use ce loss, the loss curve decreases fast. but if I use sigmoid ce loss, the loss curve barely decreases. Is it normal?
We experimented with both softmax ce and sigmoid ce, and found that sigmoid ce works better even with single label i1k - see also Are we done with imagenet? paper similar results.
As for training loss evolution, we observed the following evolution:
Thanks for replying
@andsteing are these accuracies you mentioned obtained with 224*224 resolution?
@andsteing Can you please confirm that these pretraining acc are obtained with 224*224 resolution?
(just came back from holiday)
Yes, the i1k pre-training accuracies from above are indeed for 224*224 resolution. We only changed resolution for fine-tuning runs.
@andsteing Thanks for your help. I finally reached 75.5% validation accuracy in pretraining in1k using b/16, even without some tricks that are mentioned in your papers, like stochastic depths (i do not use it), linear scheduler (i used cosine), ADAM (I used SGD), grad norm (I do not use it). I just wonder if there is an official statement of the accuracy you mentioned above. How should I cite your work in a decent way?
@andsteing in paper "how to train your vit", figure 4, left plot. vit on imagenet1k 300 epochs reached 83%. your above comment does not match these numbers. I might have missed something, can you please clarify?
Sure, results after 300 ep (edit: L/32 and L/16 were trained for 90 ep) training on i1k from scratch are below:
name val_prec_1 ViT-B/32 i1k 69.19% ViT-B/16 i1k 74.79% ViT-L/32 i1k 66.90% ViT-L/16 i1k 72.59%
this is the result that I am refering to. Looking forward to your reply.
Hi @cissoidx
This thread started on January 30th 2021 and is about the i1k from-scratch training in the original ViT paper. The paper how to train your ViT applies additional AugReg to increase those numbers, but it was published only in June 2021, so I thought it would not apply to the original question (and I thought starting to mix numbers from two different papers could make the thread more confusing).
You have all the data about the pre-training and fine-tuning of the how to train your ViT in the Colab https://colab.research.google.com/github/google-research/vision_transformer/blob/master/vit_jax_augreg.ipynb
Best, Andreas
Hi @cissoidx I'm training ViT-b16 from scratch on imagenet1k now. I just get 47.6% on val accruacy and you also got 48.4% (https://github.com/google-research/vision_transformer/issues/62#issuecomment-888779431) Can you tell me how you improved the accuracy? All my parameters are same as vit paper.
@justHungryMan I guess it is not possible to reach paper sota with the default hyperparams. Since they do not release the code, you have to tune the hyperparams. some suggestions: use imagenet aug (proposed by the randaug package), weight decay = 0.004.
See also discussion on #153
Thanks for your excellent work. Would you mind me asking what is the pretraining acc on imagenet2012 that then used for finetuning?