Exact hyperparameters for NegCLIP training. and question about imagenet accuracy reported in the paper

HarmanDotpy commented 1 year ago

Hi, I was finetuning CLIP-VIT-B-32 on the COCO train dataset which is released (which has hard negative images and text), using the hyperparameters given in the paper, but I was experiencing a large drop in the imagenet accuracy. (from 63% --> 33%) I wondering if you could provide the exact hyperparameters for negclip training, in particular, I am hoping that some details about cosine scheduler and it's parameters might be helpful. Also, the I am keeping the weight decay as 0.0 right now, is this fine?

I am using the following hyperparamters right now: cosine scheduler paprameters:

LR_SCHEDULER_PARAMS:
  sched: cosine
  warmup_steps: 50
  warmup_lr: 0.000001
  min_lr: 0.000001

other hyperparameters as mentioned in the paper:

# learning rate after warmup:
START_LEARNING_RATE: 5e-6 (I have tried all three learning rates 1e-5, 5e-6 and 1e-6) as mentioned in the paper

MAX_NUM_EPOCHS: 5
OPTIMIZER: AdamW
OPTIMIZER_PARAMS:
  weight_decay: 0.0

Another question I have is that, the imagenet accuracy reported in the paper is 75% (in table 6 of appendix). I am wondering what model is being used here, to get this accuracy. in particular, the CLIP-VITB32 pretrained on 400M data reaches around 63% accuracy, as mentioned here https://github.com/mlfoundations/open_clip/

vinid commented 1 year ago

Hello!

I am pretty sure we used the default parameters in open clip. You can also try the more recent configuration that gave us good results that is (should also be in the README):

CUDA_VISIBLE_DEVICES=0 python -m training.main \
    --train-data="./mscoco_with_negatives_training.csv" \
    --batch-size=256 \
    --epochs=5 \
    --name="negclip_256_1e-6" \
    --lr=1e-6 \
    --val-data="./mscoco_with_negatives_valid.csv"  \
    --logs="./logs/negCLIP/" \
    --pretrained="openai" \
    --model="ViT-B-32"\
    --workers 14 \
    --warmup 50

I think the result described in https://github.com/mlfoundations/open_clip/ are on zero-shot classification right? Table 6 shows the downstream performance of the image classification task (for cifar and imagenet). Probe has the following structure:

class LinearProbe(nn.Module):
        def __init__(self):
            super(LinearProbe, self).__init__()
            self.dense = nn.Linear(train_features.shape[1], np.max(train_labels) + 1)
            self.softmax = nn.Softmax(dim=-1)

        def forward(self, X, **kwargs):
            X = self.dense(X)
            return self.softmax(X)

(Updated)

HarmanDotpy commented 1 year ago

I see, so table 6 the downstream performance means after linear probing for all the datasets. That solves my problem I think.

I will close this issue in some time, thanks for the prompt reply!

vinid commented 1 year ago

No worries!

(and thanks for the comment, probably better to say it explicitly in the paper).

HarmanDotpy commented 1 year ago

I tested out the released NegCLIP model on imagenet and it gives 60% zeroshot accuracy, which is quite good, since it is only a 2-3 percent drop as compared to the pretrained CLIP model. I am trying to replicate it with my code, which is different from OpenCLIP. I am looking at what am I doing wrong while finetuning (which is resulting in a 33 performance after finetuning)

Currently, I am looking at OpenCLIP if they are doing any special while finetuning CLIP models (such as keeping some layers frozen). Let me know if you have an insight as to why my performance may be dropping when finetuning the model. (w.r.t hyperparameters/freezing any layers/ having different learning rates for different layers, if any of these techniques were used for finetuning negclip model)
I see you mentioned above the "recent configuration" used for training the NegCLIP model. Is the released NegCLIP model trained with this configuration?
The NegCLIP model which is released, did it have any frozen layers? (I see that in the above configuration, you are not freezing any layers for image or text encoder)

Thanks for your help!

vinid commented 1 year ago

Hi!

Not sure if this can help, but we saw some performance dropping (in image classification tasks) with only text hard-negatives. Image hard-negatives helped us fix the problem. Apart from that, are the logit scaling and the various grad clipping in your implementation the same as the one in OpenCLIP? What about the batch size?
Yes, that should be the model we uploaded online!
No, we actually tried freezing the vision encoder but we didn't have much success with that

HarmanDotpy commented 1 year ago

I see, the suggestions are very helpful actually, I will do some hyperparameter search / make sure the logit scaling etc are correct. about the batch size --> in the paper it's written a batch size of 1024 is used, while in the command above, 256 is used. I was wondering if you do gradient accumulation, since I am unable to fit 256 on one A100 gpu, but I am able to fit 128. I am using the negclip form of the openclip repo, which doesn't have the gradient accumulation facility, but let me know if there is another way of using a larger batch size on a single gpu, with the current code

thanks for the other suggestions and comments as well, I think I will soon be able to replicate the negclip model, with some more effort.

vinid commented 1 year ago

The batch size in the command does not reflect the real batch size. Because you need to take into account that we add negative images and the hard negative for original images and for the hard negative images. So 256 becomes 512x1024.

Adding this extract from the repo that should explain this a little bit better:

Basically, starting from a batch size of 256, you get to a contrastive matrix of 512x1024 
(for the image part we have 256 images + 256 hard images, for the text part we have 
256 captions + 256 captions from the hard images + 256 hard captions + 256 
hard captions from the hard images).

We actually were able to fit 256 of batch size in an A100 GPU. I think you should still be able to get good results on the ARO benchmark with 128 batch size, the thing that I expect to be impacted the most is the performance on cifar and imagenet.

vinid commented 1 year ago

Oh sorry, forgot to reply about gradient accumulation: we don't do gradient accumulation (because it might not help due to the fact that contrastive works better with higher batch size so you make more comparisons).

We actually wanted to implement NegCLIP in a distributed version (so you could have a lower batch size spread on multiple GPUs getting the same effect as a larger batch size), but we still haven't been able to work on that

HarmanDotpy commented 1 year ago

Thanks so much for the details, they all make sense and i think I am able to reproduce the negclip model now, ~almost.

I am getting nearly same accuracies for clip/neglip for attribution/relation using my own pretraining and evaluation code, so the last table of the paper is mostly reproduced for me.

However, I am still seeing some drop in performance of zero shot imagenet. I am getting ~54% but original negclip is getting around ~60 percent. I may have to dig in more on this.

Just putting some things here, which I did for reproducing, following some of the suggestions you gave.

I was earlier not clipping the logit_scale value to log(100) as done by openai as well as openclip in their repos. so i did that.
a major problem with my implementation was in the distributed "gathering" of tensors. I initially had a naive implementation, and when the tensors were gathered across gpus, the alignement of the image/hard images, text/hard texts etc got messed up. I corrected that implementation, after that contrastive learning improved for me. --> I think something similar would have to be implemented for the openclip repo aslo (correct gathering of tensors), so as to make negclip work correclty. I can help with this if needed, I just need to spend some time understanding the garthering part of openclip repo.
on the evaluation part, I was taking the micro accuracy, but now I started to take the macro accuracy as done in the paper and in this repo.

A question still remains:

I see in the evaluation part of the VG attribution, the attributes which are part of less than 25 datapoints in the test set, are removed from evaluation (see this line ). This doesnt really effect the negclip performance (it remains ~71%), but the pretrained clip vit-b-32 performance is ~60% if we don't remove these attributes, which ~62-63% (same as what the paper reports) if we remove these attributes, as done in the this repo. So I was just wondering if there was any specific reason to remove these attributes.

Thanks!

vinid commented 1 year ago

I am getting nearly same accuracies for clip/neglip for attribution/relation using my own pretraining and evaluation code, so the last table of the paper is mostly reproduced for me.

That's awesome!

However, I am still seeing some drop in performance of zero shot imagenet. I am getting ~54% but original negclip is getting around ~60 percent.

Were you eventually able to increase the batch size? I remember that we met similar issues that we solved once we were able to increase the batch size (similar issues: the model performs well on ARO but loses some generalization power).

I can help with this if needed, I just need to spend some time understanding the garthering part of openclip repo.

I think we will definitely ask you for feedback on this! Thanks so much!

So I was just wondering if there was any specific reason to remove these attributes.

We remove those relations for which we had too few to report something that could be significant: I think some attributes/relations occur a couple of times, however, NegCLIP or CLIP (or the other) could get this right just by chance and this can significantly shift the macro accuracy. Thus we decided to remove those under a certain threshold.

HarmanDotpy commented 1 year ago

yes i was able to increase the batch size for my code. also, thanks for the note on removal of attributes that only occur a few times.

I had one more question actually, I was wondering how are the models selected while finetuning. it's written it's based on COCO Retrieval accuracy on the validation set. is this the same validation set provided in the temp_data directory? and are t2i scores considered while choosing the best model?

vinid commented 1 year ago

no should be the original mscoco val (still from karpathy's splits)

and are t2i scores considered while choosing the best model?

sorry, would you be able to elaborate more on this?

HarmanDotpy commented 1 year ago

i see, I am having some confusion on what the karpathy splits are exactly. I did a bit of searching but is it possible that you can point me to the correct splits. And if i understand correctly, among the current tsv files, the train file is used to train the model for 5 epochs, while the validation is done not on the current validation tsv file, but another tsv file (karpathy split)

so during coco retrieval, you will get both text to image (t2i) and image to text (i2t) retrieval scores, during validation. so are the t2i scores considered while evaluating which model is the best, among all the 5 epochs?

also a related question is that, which epoch gave you best accuracy among the 5, while finetuning NegCLIP? (in case you have the logs)

vinid commented 1 year ago

original karpathy's splits should be here. The original MSCOCO had many images in the validation and karpathy's version moved some validation into training. See also the paper.

so during coco retrieval, you will get both text to image (t2i) and image to text (i2t) retrieval scores, during validation. so are the t2i scores considered while evaluating which model is the best, among all the 5 epochs?

I think we select on val loss but I have to check in with Mert on this.

also a related question is that, which epoch gave you best accuracy among the 5, while finetuning NegCLIP? (in case you have the logs)

I'll try to dig out the logs as I don't remember which one we choose, but I am sure we had a csv in which we stored the performance

HarmanDotpy commented 1 year ago

Thanks @vinid I think we select on val loss but I have to check in with Mert on this. What is meant by val_loss here. is the loss somehow calculated on the coco validation dataset (and is the setting different from retrieval)?

I am still slightly stuck on exactly reproducing NegCLIP, so it would be helpful if you or @mertyg could give an idea about best model selection strategy, and if you know which epoch gave you best accuracy among the 5 epochs.

thanks!

mertyg commented 1 year ago

Hi @HarmanDotpy , thanks for your interest again! Truly appreciate the effort to reproduce the result!!

Apologies for the confusion, we track the retrieval performance over COCO. The last epoch gave us the best performance, both in terms of Text2im and Image2Text retrieval. The val loss (contrastive loss) was also parallel to these numbers. Pls, let me know if this is helpful!

HarmanDotpy commented 1 year ago

Hi @mertyg this is very helpful information, for reproducibility.

Earlier I was able to reproduce NegCLIP on ARO, but was having some trouble with imagenet performance (zero shot) decreasing a lot. I actually solved it, it was an issue in my code where I was using slightly different img augmentations as compared to openclip/negclip repo.

I am now able to reproduce everything using my own code as well as negclip code. I think it's a good time to close this issue as well. If I have any other question, I'll open another issue.

Thanks so much for all the help!

mertyg commented 1 year ago

This is awesome to hear. Thank you so much @HarmanDotpy !

vishaal27 commented 1 year ago

Hi @HarmanDotpy, thanks for your excellent questions, these were very helpful to me as well for reproducing NegCLIP's training. I just had one question: you mentioned in one of your comments: a major problem with my implementation was in the distributed "gathering" of tensors---I assume this means you were trying to train NegCLIP on multiple GPUs? I was just wondering if in the end you did use multiple GPUs or trained only on a single GPU. Also did you exactly replicate the results using these hyperparameters from the NegCLIP repo?

CUDA_VISIBLE_DEVICES=0 python -m training.main \
    --train-data="./mscoco_with_negatives_training.csv" \
    --batch-size=256 \
    --epochs=5 \
    --name="negclip_256_1e-6" \
    --lr=1e-6 \
    --val-data="./mscoco_with_negatives_valid.csv"  \
    --logs="./logs/negCLIP/" \
    --pretrained="openai" \
    --model="ViT-B-32"\
    --workers 14 \
    --warmup 50

HarmanDotpy commented 1 year ago

Hi @vishaal27 , sorry for seeing this so late. Yes I used a multi gpu version of negclip which I wrote myself (over another CLIP implementation, different from this code but should be similar to implement in this code as well). I replicated results to a good level using the given hyperparameters, I would call it pretty close to the exact results reported (some minor differences were there due to a different implementation). However, I wasnt able to run the model on single gpu using the given hyperparameters, likely because I was using 40GB and my hypothesis was that they might be using 80 GB A100 (or there was something else becasue of which I wasnt able to fit the given batch size on the GPU)

if you are unable to replicate something, feel free to ping here.

mertyg / vision-language-models-are-bows

Exact hyperparameters for NegCLIP training. and question about imagenet accuracy reported in the paper #4