Fine-tune VIT-B-16 on imagenet2012

lucasliunju commented 3 years ago

Hi,

I try to fine-tune the ViT-B-16 Model on imagenet2012 on tpu v3-8. The top-1 accuracy is 84.1% (different from 84.6%). I would like to ask whether I need to change the default hyper-parameters for this experimrnt.

Thank you!

andsteing commented 3 years ago

Can you double check you used exactly the same parameters as in the runs from the README? https://tensorboard.dev/experiment/vNVL9RFmTBKJ4uK81CbGMQ/#scalars&_smoothingWeight=0&regexInput=imagenet21k%2FViT-B_16%2Fimagenet2012%2F

Those finetunings over 20k steps on a 8x V100 took ~18 hours and ended with 84.61% and 84.62% final accuracy.

lucasliunju commented 3 years ago

Hi, I can find learning rate, warmup on README. I would like to ask the hyper-parameters for this experiment (https://tensorboard.dev/experiment/vNVL9RFmTBKJ4uK81CbGMQ/#scalars&_smoothingWeight=0&regexInput=imagenet21k%2FViT-B_16%2Fimagenet2012%2F), such as the parameters on flags.py.

Currently, I try to use the default parameters (on flags.py) to train the model and I think these parameters is designed for cifar-10.

So I would like to ask the parameters for this experiment result (https://tensorboard.dev/experiment/vNVL9RFmTBKJ4uK81CbGMQ/#scalars&_smoothingWeight=0&regexInput=imagenet21k%2FViT-B_16%2Fimagenet2012%2F)

Thank you!

andsteing commented 3 years ago

It was trained with the default parameters; you can verify this in the hparams tab of above tensorbaord.dev link.

So training this for 20k steps on a 8x TPUv2 should give you identical results. Can you share the full training metrics for comparison?

lucasliunju commented 3 years ago

I think maybe that because I am using tpu not gpu?

lucasliunju commented 3 years ago

Hi, I just change the dataset from cifar-10 to imagenet2012 and havn't change any others on this code. My training log as follows: https://docs.google.com/document/d/1uWwylLuNi_aQsYaovNCuM3fKVuPRVIuMXFShM5foDRg/edit?usp=sharing

Comparing with the results on https://tensorboard.dev/experiment/vNVL9RFmTBKJ4uK81CbGMQ/#scalars&_smoothingWeight=0&regexInput=imagenet21k%2FViT-B_16%2Fimagenet2012%2F

I find the test accuracy has a gap from step 2000.

andsteing commented 3 years ago

That's unexpected.

We have produced our original results on TPUs, but then I only tested the open sourced code on GPUs (links form README). Let me rerun the code with TPUs to see if I can reproduce your results first.

lucasliunju commented 3 years ago

Thanks so much!

lucasliunju commented 3 years ago

Hi, I would like to ask the update result you run on the tpu. I find there ia also a gap on the result of cifar10.

Thank you!

Yong

andsteing commented 3 years ago

I can confirm I also got a similar 84.13% final accuracy that you already reported on a 8 TPUv2. I am now rerunning with some changed configs on both TPU and GPU to verify these results and try to understand what could cause the difference. Will update here later when results are available.

lucasliunju commented 3 years ago

Thank you. Have a nice day!

andsteing commented 3 years ago

By accident, the original runs used dropout=0.0 which resulted in an improvement of the results reported in the README over the results reported in the paper (where we have 83.97% top-1 accuracy for B/16).

I added a comment to the top of the table but that got removed when the README was later updated with additional results. Fixed in dab0a5c.

I also checked that you get 84.63% when running on TPU with dropout=0.0 (and that GPU gets to 84.12%, when running with dropout=0.1).

We're working on an updated release using the newer Flax Linen API and will regenerate the entire table for that purpose.

Dropout is set in the config file here:

https://github.com/google-research/vision_transformer/blob/dab0a5cf0ebede4f4474cc1a05c7623b9a34d6d3/vit_jax/configs.py#L44

lucasliunju commented 3 years ago

Thanks so much! I'll try it immediately. By the way, did you try to use LARS as the optimizer?

Thank you!

Yong

andsteing commented 3 years ago

No we didn't try LARS optimizer. But that might be worthwhile.

lucasliunju commented 3 years ago

Thanks for your reply! It does work now. I'll try to implement LARS optimizer and update it when I finished it.

Yong

shairoz-deci commented 2 years ago

@lucasliunju Can you please share your fine tuning code for imagenet 1K? I am trying to fine tune ViT Base_16 on Imagenet 1L from Imagenet21K pretraining with image size of 224 and can't reproduce the results (reaching an acc of 83.7 while the reported results are 84.4 ). Specifically, can you mention the data augmentations used and perhaps additional methods used (EMA? Averaging? head Initialization)

Thanks in advanced

google-research / vision_transformer

Fine-tune VIT-B-16 on imagenet2012 #85