This repo is the official pytorch implementation of the paper: CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation
There are different operations in the modified_clip/model.py and modified_clip/open_model.py, like img_features[kth] = ln_x in model.py and img_features[kth] = ln_x - img_features[kth] in open_model.py. Why is there such a difference? They should just be introducing different sizes of ViT-based CLIP models through package CLIP and OpenCLIP.
In forward of model.py, are some operations like concat fg_text_features.mean(0, True) into the text_features, and seg_last[seg_last < seg_last.amax(0, keepdim=True) * 0.2] = 0 used to improve the performance? how to determine the threshold as 0.2?
BTW, this code is simple yet elegant. Thanks for your impressive work.
Thank you for carefully reviewing our code. Regarding the issue you mentioned:
The implementation in open_model.py should be consistent with that in model.py. However, it is worth noting that this minor change does not affect the overall conclusions. This is based on that the linear mapping in the last attention layer tends to produce features with opposite characteristics before and after the mapping. The subtraction operation reutilizes the features from the last layer. Regarding the experiments with OpenCLIP, they are not discussed in our paper but are part of the extended experiments conducted in this repository.
The concatenation for text features was not deliberately designed. You can remove this operation and check the results; We don't think it will have much impact. In contrast, the zero operation may be necessary in our work, where the threshold is not a hyperparameter. We set the threshold to 0.2 for all datasets. You can experiment with reasonable values, but results are unlikely to change significantly.
Hi, I'm following and have some questions:
modified_clip/model.py
andmodified_clip/open_model.py
, likeimg_features[kth] = ln_x
in model.py andimg_features[kth] = ln_x - img_features[kth]
in open_model.py. Why is there such a difference? They should just be introducing different sizes of ViT-based CLIP models through package CLIP and OpenCLIP.fg_text_features.mean(0, True)
into the text_features, andseg_last[seg_last < seg_last.amax(0, keepdim=True) * 0.2] = 0
used to improve the performance? how to determine the threshold as 0.2?BTW, this code is simple yet elegant. Thanks for your impressive work.