Closed qiulesun closed 2 years ago
Hi, thanks for your interest in our work.
As shown in Figure 3, the input of the text encoder is [<bos>
, p1
, ..., pN
, class name
, <eos>
]. We can use text.argmax(dim=-1)
to get the index of the <eos>
in [<bos>
, class name
, <eos>
] following the implementation of CLIP. By adding N2, we can obtain the index of the <eos>
after inserting the learnable context [p1
, ..., pN
].
Thanks for your quick reply.
In line697, the input of the text encoder is changed to [bos
, class name1, p1, ..., pN, class name 120, eos
] instead of [bos
, p1, ..., pN, class name, eos
]. So the way to get the index of the eos
bothers me.
x_text
actually is [<bos>
, class name
, <eos>
]. Therefore, in line 697, x_text[:,:,0:1]
is the <bos>
and x_text[:, :, 1:]
is [class name
, <eos>
].
I understand it. Thanks for your explanation again.
Anthor question is that have you try to insert the learnable context [p1, ..., pN] into different position, e.g., [bos
, class name1, class name2, p1, ..., pN, class name3, ..., class name 120, eos
]
The idea of learnable context is from CoOp. Their results (figure 3) show different ways to insert the learnable context may not largely affect the final performance. Therefore, we directly use their default setting in our framework.
@raoyongming The mIoU of ADE20K achieved by DenseCLIP is better than other SOTA. The results on other datasets may be very good as well. Do you consider releasing mIoU values of other mainstream segmentation datasets, i.e., Pascal Context and Cityscapes or providing corresponding config scripts to fineturn DenseCLIP ?
As ADE20k is more challenging on other datasets, we conducted our main experiments on this dataset. We may test our method on more datasets in the future. Since our implementation is based on mmseg
, I think it is quite easy to extend our implementation to other datasets by changing the datasets and the number of classes in config files.
@raoyongming Based on mmseg, DenseCLIP can be easily applied to other segmentation datasets. However, for hyperparameters, baselr, lr of backbone, text_encoder, context_decoder, neck and head have to carefully be turned to get desired performance in terms of different datasets.
I agree. Since I am busy on other projects currently, I may try it in the future. Please tell us if you have any results.
@raoyongming I am trying to apply DenseCLIP to PContext and Cityscapes datasets and believe that it can achieve good performance. In terms of hyperparameters for both datasets, e.g., baselr, lr of backbone, text_encoder, context_decoder, neck and head, can you give me some advice about them ?
mmseg
uses the same lr
config for ADE and Cityscapes. I think you can start by using the same configurations on these new datasets.
Nice work! For class of CLIPTextContextEncoder, why did the line eos_indx = text.argmax(dim=-1) + N2 add N2 ?