lezhang7 / Enhance-FineGrained

[CVPR' 2024] Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Fine-grained Understanding
Other
36 stars 1 forks source link

About the performance on ELEVATER #16

Closed hiker-lw closed 4 months ago

hiker-lw commented 4 months ago

hello, sorry to bother you again, I notice that the CE-CLIP's performance on ELEVATER reported in the paper is 53.2, however in my case this is 44.4 using your provided checkpoint. As this is a huge gap, and I don't know if my code is wrong or for some other reasons. Would you mind share you test code on ELEVATER? Thanks very much!

lezhang7 commented 4 months ago

Hi,

There was a bug in testing ELEVATER in the original arXiv version, where we didn't incorporate all datasets. As a result, we observed a drop in zero-shot image classification performance across all models, including Negclip, SVLC, and ours, due to the absence of Lora.

Instead, we report ImageNet1k linear probing performance to demonstrate that the visual representation is robust and retains its original capabilities folloing NegCLIP paper. You can find the test code at GitHub - LAION-AI/CLIP_benchmark.

Best regards,

hiker-lw commented 4 months ago

Thanks for your reply sincerely! Would you mind describing more details about the bug?

lezhang7 commented 4 months ago

We did not incorporate some datasets during ELEVATOR evaluation

hiker-lw commented 4 months ago

You mean the non-trival performance drop of negative text augmented model like NegCLIP, DAC, CLIP-SVLC, etc. is indeed exisit, and there is no bug in the evaluation code of ELEVATOR?

lezhang7 commented 4 months ago

Yes, all models with hard negative text generation show a drop in zero-shot image classification performance, but they maintain their performance on the linear probing classification task. The finetuning of text encoder affects this. Train with lora can alleviate this effectively though.

hiker-lw commented 4 months ago

But DAC and CLIP-SVLC are also trained with lora, I still observed big performance drop (-12% and -7% respectively). I feel this problem is hard to resolve without hard negative text. Anyway, thanks very very much~~ If it's convenient, could we add each other on WeChat to communicate?

hiker-lw commented 4 months ago

I have been working on this task for a year now and still have not made any significant progress. I sincerely hope to have the opportunity to discuss it further with you if you don't mind~

lezhang7 commented 4 months ago

Yes sure ! my wechat number is Leo723_Z. My hypo is that training with hard negative texts will bias models towards image-to-text retrieval task, and it's kind of ood finetuning that makes it forget about original capability. One easy fix is mix the training with pretraining data as in https://github.com/wjpoom/SPEC .

hiker-lw commented 4 months ago

Thanks so much! You are really a kind reseacher, best regards~