openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
26.23k stars 3.35k forks source link

Linear Probing ImageNet #400

Open tudorcebere opened 1 year ago

tudorcebere commented 1 year ago

Hello! Thank you for this excellent model & paper!

I am interested in reproducing the linear probing results in the paper for ImageNet (using SGD). Can the authors provide some insights into how they achieve the results in Table 10 of the paper? My attempts using ViT-32 have significantly inferior test time performance, it seems that it fails to learn very badly. I have followed the example

Thank you!

jongwook commented 1 year ago

Hi, we used full-batch linear regression using L-BFGS. For Imagenet with 1M+ images in the training split it was quite slow and requires huge memory especially considering the hyperparameter sweep for the L2 regularization term (C). Some discussions can be found in: