Closed hiker-lw closed 5 months ago
Hi,
There was a bug in testing ELEVATER in the original arXiv version, where we didn't incorporate all datasets. As a result, we observed a drop in zero-shot image classification performance across all models, including Negclip, SVLC, and ours, due to the absence of Lora.
Instead, we report ImageNet1k linear probing performance to demonstrate that the visual representation is robust and retains its original capabilities folloing NegCLIP paper. You can find the test code at GitHub - LAION-AI/CLIP_benchmark.
Best regards,
Thanks for your reply sincerely! Would you mind describing more details about the bug?
We did not incorporate some datasets during ELEVATOR evaluation
You mean the non-trival performance drop of negative text augmented model like NegCLIP, DAC, CLIP-SVLC, etc. is indeed exisit, and there is no bug in the evaluation code of ELEVATOR?
Yes, all models with hard negative text generation show a drop in zero-shot image classification performance, but they maintain their performance on the linear probing classification task. The finetuning of text encoder affects this. Train with lora can alleviate this effectively though.
But DAC and CLIP-SVLC are also trained with lora, I still observed big performance drop (-12% and -7% respectively). I feel this problem is hard to resolve without hard negative text. Anyway, thanks very very much~~ If it's convenient, could we add each other on WeChat to communicate?
I have been working on this task for a year now and still have not made any significant progress. I sincerely hope to have the opportunity to discuss it further with you if you don't mind~
Yes sure ! my wechat number is Leo723_Z. My hypo is that training with hard negative texts will bias models towards image-to-text retrieval task, and it's kind of ood finetuning that makes it forget about original capability. One easy fix is mix the training with pretraining data as in https://github.com/wjpoom/SPEC .
Thanks so much! You are really a kind reseacher, best regards~
Hi @hiker-lw, I recently came across this thread and wanted to share my experience with similar experiments.
In my case, I also noticed that fine-tuning CLIP with hard negative captions tends to cause a drop in zero-shot classification performance (measured by ELEVATER) and image to text (I2T) retrieval scores particularly when the model is fine-tuned on datasets other than COCO. (The COCO training split provided by the ARO paper authors seems to include some images from the validation set, which I believe needs to be corrected.)
You can find a brief report of my findings in Table 5 on page 14, where I evaluated fine-tuned models on compositional reasoning benchmarks, ELEVATER, and image-text retrieval using the respective checkpoints.
This challenge inspired my recent work, FSC-CLIP, which aims to preserve performance in areas beyond compositional reasoning. Specifically, incorporating focal loss and label smoothing may help achieve better trade-offs.
P.S. Fine-tuning with LoRA may not offer a complete solution, as it seems to trade off between compositionality and other tasks. In my initial trials, a higher learning rate led to improvements in compositionality but resulted in less preservation of performance on other tasks.
I hope my trial and error are helpful to your research. Feel free to reach out if you want any further information
Yes, it was really a great work! Definitely worth reading, see you at Miami @ytaek-oh !
Thank you for your recommendation! I'm really impressed by your work on compositional reasoning, CE-CLIP and VisMin. I look forward to connecting with you in Miami!
hello, sorry to bother you again, I notice that the CE-CLIP's performance on ELEVATER reported in the paper is 53.2, however in my case this is 44.4 using your provided checkpoint. As this is a huge gap, and I don't know if my code is wrong or for some other reasons. Would you mind share you test code on ELEVATER? Thanks very much!