Question about some operations

Kimsure commented 4 days ago

Hi, I'm following and have some questions:

There are different operations in the modified_clip/model.py and modified_clip/open_model.py, like img_features[kth] = ln_x in model.py and img_features[kth] = ln_x - img_features[kth] in open_model.py. Why is there such a difference? They should just be introducing different sizes of ViT-based CLIP models through package CLIP and OpenCLIP.
In forward of model.py, are some operations like concat fg_text_features.mean(0, True) into the text_features, and seg_last[seg_last < seg_last.amax(0, keepdim=True) * 0.2] = 0 used to improve the performance? how to determine the threshold as 0.2?

BTW, this code is simple yet elegant. Thanks for your impressive work.

linsun449 commented 1 day ago

Thank you for carefully reviewing our code. Regarding the issue you mentioned:

The implementation in open_model.py should be consistent with that in model.py. However, it is worth noting that this minor change does not affect the overall conclusions. This is based on that the linear mapping in the last attention layer tends to produce features with opposite characteristics before and after the mapping. The subtraction operation reutilizes the features from the last layer. Regarding the experiments with OpenCLIP, they are not discussed in our paper but are part of the extended experiments conducted in this repository.
The concatenation for text features was not deliberately designed. You can remove this operation and check the results; We don't think it will have much impact. In contrast, the zero operation may be necessary in our work, where the threshold is not a hyperparameter. We set the threshold to 0.2 for all datasets. You can experiment with reasonable values, but results are unlikely to change significantly.

Kimsure commented 9 hours ago

Thanks for your reply. I will conduct experiments to verify the impact of different operations.

linsun449 / cliper.code