Finetuning CLIP VIT32 on COCO captions

MLRadfys commented 5 months ago

Hi and thank you so much for this awesome repository!

Iam trying to finetune the openAI CLIP V32 model on the COCO dataset and observe some major overfitting, though Iam not really sure what the problem is. I basically took the standard finetuning parameters (batch size = 256, lr = 5e-6, warmpup = 10000). Unfortunatley the loss functions looks like this:

Another strange thing is the scale parameter. It decreases in the beginning and goes up to 100 again during the end of the training:

But I also have to mention that some validation metrices increases: Validation prior-training:

image_to_text_mean_rank: 40.5519 image_to_text_median_rank: 6.0000 image_to_text_R@1: 0.2622 image_to_text_R@5: 0.4770 image_to_text_R@10: 0.5818 text_to_image_mean_rank: 56.3405 text_to_image_median_rank: 8.0000 text_to_image_R@1: 0.2268 text_to_image_R@5: 0.4399 text_to_image_R@10: 0.5386 val_loss: 1.3907 epoch: 0.0000 num_samples: 11829.0000

Post-training: image_to_text_mean_rank: 19.8642 image_to_text_median_rank: 3.0000 image_to_text_R@1: 0.3605 image_to_text_R@5: 0.6200 image_to_text_R@10: 0.7266 text_to_image_mean_rank: 28.4263 text_to_image_median_rank: 3.0000 text_to_image_R@1: 0.3553 text_to_image_R@5: 0.6277 text_to_image_R@10: 0.7265 val_loss: 1.1578 epoch: 30.0000 num_samples: 11829.0000

Any advice would be really appreciated :-)

Thanks in advance,

M

sean-xr commented 1 month ago

Hello, I mght be facing the same issue, did you figure it out? Thanks for the help

MLRadfys commented 1 month ago

Hi!

Yes, If I remember correctly I reduced both the learning rate and the batch size to 64 :-)

Cheers,

M

mlfoundations / open_clip

Finetuning CLIP VIT32 on COCO captions #811