google-research / vision_transformer

Apache License 2.0
10.57k stars 1.3k forks source link

Inquiry on pretraining acc #62

Open yyou1996 opened 3 years ago

yyou1996 commented 3 years ago

Thanks for your excellent work. Would you mind me asking what is the pretraining acc on imagenet2012 that then used for finetuning?

andsteing commented 3 years ago

Note that the published checkpoints were pretrained on imagenet21k. Pretraining accuracy on the validation set at the endof the pretraining was:

name val_prec_1
ViT-B_16 47.88%
ViT-B_32 44.04%
ViT-L_16 49.90%
ViT-L_32 45.42%
ViT-H_14 49.06%
cissoidx commented 3 years ago

@andsteing hello, do you have, by any chance, the acc of pretraining on imagenet1k? I mean pretraining from absolute scratch, not finetuning. I reached 48.4% using ViT-B_16 on the imanget1k validation set, would like to have reference if you have it.

andsteing commented 3 years ago

Sure, results after 300 ep (edit: L/32 and L/16 were trained for 90 ep) training on i1k from scratch are below:

name val_prec_1
ViT-B/32 i1k 69.19%
ViT-B/16 i1k 74.79%
ViT-L/32 i1k 66.90%
ViT-L/16 i1k 72.59%
cissoidx commented 3 years ago

thanks. I trained much less ep. will try again.

andsteing commented 3 years ago

For training ViT from scratch you'll find that data augmentation and model regularization really help with medium-sized datasets (such as ImageNet, and ImageNet-21k), but with even longer training schedules (1000 ep for ImageNet, and 300 ep for ImageNet-21k):

cissoidx commented 3 years ago

thanks for info.

cissoidx commented 3 years ago

@andsteing What is the loss function you use when doing pretraining? Is it the same as what you do in finetuning? I have seen people use semantic loss, is it necessary to reach sota?

andsteing commented 3 years ago

We used sigmoid crossentropy during pre-training (and we're using softmax crossentropy for fine-tuning).

cissoidx commented 3 years ago

Hi @andsteing,

afaik, normally we use sigmoid ce loss in multi-label task, since we assume that the labels are independent, and we use softmax ce loss in single label task, since we are looking for the max class. And these two are actually the same in binary classification.

pretraining is not a multi-label task, why do you use sigmoid ce loss?

cissoidx commented 3 years ago

I found out that if I use ce loss, the loss curve decreases fast. but if I use sigmoid ce loss, the loss curve barely decreases. Is it normal?

andsteing commented 3 years ago

We experimented with both softmax ce and sigmoid ce, and found that sigmoid ce works better even with single label i1k - see also Are we done with imagenet? paper similar results.

As for training loss evolution, we observed the following evolution:

image

cissoidx commented 3 years ago

Thanks for replying

cissoidx commented 3 years ago

@andsteing are these accuracies you mentioned obtained with 224*224 resolution?

cissoidx commented 3 years ago

@andsteing Can you please confirm that these pretraining acc are obtained with 224*224 resolution?

andsteing commented 3 years ago

(just came back from holiday)

Yes, the i1k pre-training accuracies from above are indeed for 224*224 resolution. We only changed resolution for fine-tuning runs.

cissoidx commented 3 years ago

@andsteing Thanks for your help. I finally reached 75.5% validation accuracy in pretraining in1k using b/16, even without some tricks that are mentioned in your papers, like stochastic depths (i do not use it), linear scheduler (i used cosine), ADAM (I used SGD), grad norm (I do not use it). I just wonder if there is an official statement of the accuracy you mentioned above. How should I cite your work in a decent way?

cissoidx commented 3 years ago

@andsteing in paper "how to train your vit", figure 4, left plot. vit on imagenet1k 300 epochs reached 83%. your above comment does not match these numbers. I might have missed something, can you please clarify?

Sure, results after 300 ep (edit: L/32 and L/16 were trained for 90 ep) training on i1k from scratch are below:

name val_prec_1 ViT-B/32 i1k 69.19% ViT-B/16 i1k 74.79% ViT-L/32 i1k 66.90% ViT-L/16 i1k 72.59%

cissoidx commented 3 years ago
截屏2021-10-11 上午11 51 52

this is the result that I am refering to. Looking forward to your reply.

andsteing commented 3 years ago

Hi @cissoidx

This thread started on January 30th 2021 and is about the i1k from-scratch training in the original ViT paper. The paper how to train your ViT applies additional AugReg to increase those numbers, but it was published only in June 2021, so I thought it would not apply to the original question (and I thought starting to mix numbers from two different papers could make the thread more confusing).

You have all the data about the pre-training and fine-tuning of the how to train your ViT in the Colab https://colab.research.google.com/github/google-research/vision_transformer/blob/master/vit_jax_augreg.ipynb

Best, Andreas

justHungryMan commented 3 years ago

Hi @cissoidx I'm training ViT-b16 from scratch on imagenet1k now. I just get 47.6% on val accruacy and you also got 48.4% (https://github.com/google-research/vision_transformer/issues/62#issuecomment-888779431) Can you tell me how you improved the accuracy? All my parameters are same as vit paper.

cissoidx commented 3 years ago

@justHungryMan I guess it is not possible to reach paper sota with the default hyperparams. Since they do not release the code, you have to tune the hyperparams. some suggestions: use imagenet aug (proposed by the randaug package), weight decay = 0.004.

andsteing commented 3 years ago

See also discussion on #153