lorebianchi98 / FG-CLIP

[CBMI2024 Best Paper] Official repository of the paper "Is CLIP the main roadblock for fine-grained open-world perception?".
https://lorebianchi98.github.io/FG-CLIP/
18 stars 0 forks source link

About implementation of codes #1

Closed Anderson-Wu closed 2 months ago

Anderson-Wu commented 4 months ago

Thank you for this great work; it has helped my research a lot! While tracing the code, I encountered some problems.

https://github.com/lorebianchi98/FG-CLIP/blob/6253040cb7e89a88aebb2328919333524455f6cc/src/train_util.py#L172-L176

  1. Why shuffle the COCO validation set during fine-grained training?
  2. Why is the COCO batch size 4 times lower than the FG-OVD batch size at the fined-grained training phase?
lorebianchi98 commented 4 months ago

Thank you for your interest in our work!

You raised some valid points. We performed the evaluation on the COCO validation set with less rigor because these results do not influence the final outcomes of our paper. This evaluation primarily served to observe the trend of coarse-grained capabilities during fine-grained training. You're correct about the shuffling of the COCO validation set; it should be set to False (an oversight on our part, I will fix it :) ).

Regarding the batch size, we reduced it because evaluating the triplet loss on COCO is significantly more computationally expensive than on the FG-OVD dataset. Specifically, for COCO, we have to compute batch_size x batch_size similarity scores, whereas for the FG-OVD dataset, the number of similarity scores is batch_size x 11 (the size of the vocabulary). Reducing the batch size for COCO evaluation allowed us to use a larger batch size during the FG training ;)

Anderson-Wu commented 4 months ago

Thank you for your responses!

Since I want to merge the two training steps into one, with the training set composed of the training sets from both the first and second steps, and the validation set composed of the validation sets from both steps, I have one more question.

In the warmup step, the images are from COCO 2014, and in the fine-grained step, the images are from COCO 2017. Do the images in the training sets of the fine-grained step appear in the validation set of the warmup step, or do the images in the validation sets of the fine-grained step appear in the training set of the warmup step?

lorebianchi98 commented 4 months ago

Yes, there are overlapping images between the two steps due to an overlap between the images in the COCO 2014 validation set and the COCO 2017 training set, and vice versa. However, since the two steps use different annotations (the original COCO captions during the warmup step and the FG-OVD captions during the fine-grained step), we did not consider this overlap to be problematic.

If you have any further questions or need more clarification, feel free to ask!

Anderson-Wu commented 2 months ago

Hello,

I have one more thing that needs to be clarified in the code. Should the sigmoid in the code be replaced with cosine?

https://github.com/lorebianchi98/FG-CLIP/blob/6253040cb7e89a88aebb2328919333524455f6cc/src/model.py#L115

lorebianchi98 commented 2 months ago

Yes, you're right! This is a small mistake I introduced when trying to "beautify" the code before releasing it to the public. The experiments in the paper were indeed performed using cosine similarity. I'll make sure to fix this. Thanks for bringing this to my attention!