What does the different contexts_length setting based on? What is the meaning of separation?

raoyongming / DenseCLIP

[CVPR 2022] DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

505 stars 38 forks source link

What does the different contexts_length setting based on? What is the meaning of separation? #40

Closed GanPeixin closed 1 year ago

GanPeixin commented 1 year ago

In denseclip.py, there has the following equation: “context_length = self.text_encoder.context_length - self.context_length” Can you explain to me the role of different contexts? And why we need to use the "-" to calculate the context_length of self.contexts?

raoyongming commented 1 year ago

Hi, thanks for your interest in our work. Here the self.contexts represents the learnable contexts. As shown in Fig. 3 of our paper, the context consists of several learnable text embeddings (p1, ..., pN) and fixed class names. self.text_encoder.context_length is the overall length of the text embeddings and self.context_length is the length of fixed embeddings.

GanPeixin commented 1 year ago

Thank you for your answer！And I still have some problems: 1.The learnarable contexts seems to share same ideas with the work 'CoOp'. But there are some small differences between your method and 'CoOp', for example in following code from 'class CLIPTextContextEncoder' in 'models.py'. So I want to know the reason: ①In CoOp, it first concat learnarable text (X X X X) and fixed text (class names) to get a whole prompts like [X X X X ...class names...] then tokenize and embedding it to get overall context. And you seem to first get learnarable text embedding and fixed text embedding then concat it. What is the impact of the different order of concat and embedding? ②And your work seems to use a transformer on text_embedding 'x'. What is it used for?

2.I notice that your context_length is usually set as 5,8,13, while other methods usually use 77, Why are there such differences in this setting?

raoyongming commented 1 year ago

Hi @GanPeixin,

We actually tried both of them in our experiments and didn't see significant differences in performance between the two implementations. The main advantage of our final implementation is that the text encoder can be totally removed during inference: we only need to save the text embeddings of those class names and then concat them with the learnable prompts without using the large Transformer encoder from CLIP. The small extra Transformer is used to refine the text feature to our new tasks.
We found class names in our tasks are usually short. Therefore, we simply reduce the max length of texts to save GPU memory (texts will always be padded to the max length like 77 before being fed into the CLIP text encoder). The change will also not degrade the final performance while saving a lot of memory/computation.

GanPeixin commented 1 year ago

Thank you for your answer! You really did a lot of work on the network structrue.