raoyongming / DenseCLIP

[CVPR 2022] DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
517 stars 39 forks source link

question about eos_indx in model.py #4

Closed qiulesun closed 2 years ago

qiulesun commented 2 years ago

Nice work! For class of CLIPTextContextEncoder, why did the line eos_indx = text.argmax(dim=-1) + N2 add N2 ?

raoyongming commented 2 years ago

Hi, thanks for your interest in our work.

As shown in Figure 3, the input of the text encoder is [<bos>, p1, ..., pN, class name, <eos>]. We can use text.argmax(dim=-1) to get the index of the <eos> in [<bos>, class name, <eos>] following the implementation of CLIP. By adding N2, we can obtain the index of the <eos> after inserting the learnable context [p1, ..., pN].

qiulesun commented 2 years ago

Thanks for your quick reply. In line697, the input of the text encoder is changed to [bos, class name1, p1, ..., pN, class name 120, eos] instead of [bos, p1, ..., pN, class name, eos]. So the way to get the index of the eosbothers me.

raoyongming commented 2 years ago

x_text actually is [<bos>, class name, <eos>]. Therefore, in line 697, x_text[:,:,0:1] is the <bos> and x_text[:, :, 1:] is [class name, <eos>].

qiulesun commented 2 years ago

I understand it. Thanks for your explanation again. Anthor question is that have you try to insert the learnable context [p1, ..., pN] into different position, e.g., [bos, class name1, class name2, p1, ..., pN, class name3, ..., class name 120, eos]

raoyongming commented 2 years ago

The idea of learnable context is from CoOp. Their results (figure 3) show different ways to insert the learnable context may not largely affect the final performance. Therefore, we directly use their default setting in our framework.

qiulesun commented 2 years ago

@raoyongming The mIoU of ADE20K achieved by DenseCLIP is better than other SOTA. The results on other datasets may be very good as well. Do you consider releasing mIoU values of other mainstream segmentation datasets, i.e., Pascal Context and Cityscapes or providing corresponding config scripts to fineturn DenseCLIP ?

raoyongming commented 2 years ago

As ADE20k is more challenging on other datasets, we conducted our main experiments on this dataset. We may test our method on more datasets in the future. Since our implementation is based on mmseg, I think it is quite easy to extend our implementation to other datasets by changing the datasets and the number of classes in config files.

qiulesun commented 2 years ago

@raoyongming Based on mmseg, DenseCLIP can be easily applied to other segmentation datasets. However, for hyperparameters, baselr, lr of backbone, text_encoder, context_decoder, neck and head have to carefully be turned to get desired performance in terms of different datasets.

raoyongming commented 2 years ago

I agree. Since I am busy on other projects currently, I may try it in the future. Please tell us if you have any results.

qiulesun commented 2 years ago

@raoyongming I am trying to apply DenseCLIP to PContext and Cityscapes datasets and believe that it can achieve good performance. In terms of hyperparameters for both datasets, e.g., baselr, lr of backbone, text_encoder, context_decoder, neck and head, can you give me some advice about them ?

raoyongming commented 2 years ago

mmseg uses the same lr config for ADE and Cityscapes. I think you can start by using the same configurations on these new datasets.