google-research / vision_transformer

Apache License 2.0
10.3k stars 1.28k forks source link

train ViT-b16 from scratch on Imagenet #153

Open justHungryMan opened 2 years ago

justHungryMan commented 2 years ago

Thanks for your work and Detailed answer in issues. I am reproducing the ViT B-16 in Tensorflow based on your Paper and answers. (In this issue, I only deal with original ViT paper) But I just reached about 47% on imagenet1k validation set (Upstream) I want to know if my experimental conditions are incorrect, and I hope this issue helps others to reproduce ViT from cratch.

Here is my train_loss, val_loss, train_acc, val_acc, lr curve

Train loss

train_loss

Val Loss

val_loss

Train acc

train_acc

Val acc

val_acc

LR

lr

I don't know if you can afford to look at the code, but here is my code .

andsteing commented 2 years ago

Hi Sangjun

A couple of differences:

  1. Even with smaller batch size, I would keep the same lr=3e-3 (when using Adam)
  2. Speaking of Adam, we used the following parameters: beta1=.9, beta2=.999
  3. We used float32 throughout for the pre-training.
  4. For evaluation, we first resized to 256px (smaller side, keeping original aspect ratio), then took a 224px central crop.

Final validation top-1 accuracy was 74.79% and the training curves looked as follows: image

HTH, Andreas

justHungryMan commented 2 years ago

@andsteing Thank you for your answer

  1. Why is the batch size different but the same lr?
  2. I know you are using TPU, but is there a reason why you didn't use bfloat16 and instead used float32? (accuracy in attention?)
  3. Ok, same for inception crop in train_set but resize 256 on smaller side (for example 768x512 -> 384x256) & center_crop 224 in val_set

I'll be back after 3 days for train... :)

andsteing commented 2 years ago
  1. When pre-training vision transformers we found that optimal learning rate was pretty stable with different batch sizes. You could try different learning rates to verify, but I fear there's no simple relationship between learning rate and batch size, when pre-training vision transformers (see e.g. http://arxiv.org/abs/1811.03600 for an empirical study into this subject).
  2. In general going from full to half precision lowers the quality and can lead to instabilities. We would usually do everything in float32, and then later on try individual parts of the parameters and/or optimizer to run with bfloat16. But lowering the precision while keeping the quality constant in general requires some trial and error and it's not obvious what will work and what won't
  3. That's correct
justHungryMan commented 2 years ago

Hi @andsteing. Thanks to you, I was able to do the experiment well. For users who want to train from scratch, I opened repo (https://github.com/justHungryMan/vision-transformer-tf). (Please let me know if this is a problem.)

I've discovered a few things while doing the experiment, and I'd like to hear your opinions.

  1. Val accuracy in upstream is not directed to higher accuracy in downstream. When I train b/16 with lr=0.003, it reached 74.37% in upstream (imagenet1k, bs=1024, call it A) and lr=0.00075 reached 74.52% in upstream(call it B) but A reached 74.896% in downstream and B reached 72.467%. It is interesting but hard to understand why.

  2. When finetuing, the public code only uses "resize" in val_set, but resize to 416 and center crop to 384 gives higher accuracy (74.896% -> 75.945%). I wanna ask why you use "resize" on val_set?

yzlnew commented 1 year ago

Hi, @andsteing . I'm trying to train ViT-b16 from scratch on ImageNet using SAM optimizer. Could you share your training details? From paper, I believe it is using the same as base optimizer with a rho=0.2. However, I can't get anywhere near 79.9%.

andsteing commented 1 year ago

@xiangning-chen who added SAM checkpoints in #119

xiangning-chen commented 1 year ago

Hi @yzlnew , how many machines are you using to train the model? This essentially determine the distributed level of SAM, which corresponds to the m-sharpness discussed in section 4.1 here. For my experiments, I used 64 TPU chips. If you are using fewer machines, my experience is to enlarge the rho.

yzlnew commented 1 year ago

@xiangning-chen Thanks for clarification! I have tried training on 4/8/32 A100 cards with rho=0.2. And I also notice that a larger rho can improve the performance in other experiments.