mlfoundations / wise-ft

Robust fine-tuning of zero-shot models
https://arxiv.org/abs/2109.01903
Other
629 stars 65 forks source link

Poor performance on ResNet. #10

Closed jingzhengli closed 1 year ago

jingzhengli commented 1 year ago

Although good performace obtained by fine tuning ViT model, I found the poor performance on the ResNet models. Thus, How to fine tune the CLIP model by using pre-trained ResNet models? Thanks.

gabrielilharco commented 1 year ago

Hi @jingzhengli, could you give more details on your experimental setting and which results you are getting? Thanks!

jingzhengli commented 1 year ago

Hi @jingzhengli, could you give more details on your experimental setting and which results you are getting? Thanks!

Hi, thanks for your quick reply and your nice work. For fine-tuning CLIP, I have some questions. The first question is how to fine-tune CLIP: different from the baselines "end-to-end" and "linear classifier" in your paper, I full fine-tune (update both vision and text encoders) the ViT-B/16-based CLIP on 11 public datasets with 16-shots, found a boost on all the datasets. However, if the ViT-based vision encoder is replaced with Resnet50 or Resnet101, then the performance becomes worse compared to zero-shot CLIP. I also implement the "linear classifier" to fine-tune the ResNet-based CLIP on Imagenet dataset, the accuracy is 56% compared to the 60% of zero-shot CLIP. The second question is the implemention of fine-tuning CLIP. I run the experiments using my own implementation not your released code, I would like to know how to initialize learning rate. Thanks again for your consideration.

gabrielilharco commented 1 year ago

Hi @jingzhengli. It seems like there are quite a few experimental differences then, so it's hard to pinpoint what the issue might be. If I understood correctly, it's a bit odd that your linear classifier is giving lower accuracy than the corresponding zero-shot model. If you are initializing the head with the zero-shot weights, this is likely an issue with your hyper-parameters or a bug.

Re. learning rate, I'd recommend doing a sweep, since your experimental setting is different. Also note that weight interpolation (and thus WiSE-FT) can perform poorly if the learning rate is too large, so I'd recommend erring on the side of smaller learning rates if you can't do a proper hyper-parameter search.