raoyongming / DenseCLIP

[CVPR 2022] DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
505 stars 38 forks source link

Questions about the architecture #21

Closed xiaoachen98 closed 2 years ago

xiaoachen98 commented 2 years ago

You present in the paper only the results related to Semantic FPN.

Have you conducted any relevant experiments based on dilated backbone methods (e.g. DeepLabV3+)?

Is the reason for not using dilated backbone-based methods that you are looking for fewer FLOPs or have you found that the results are not good?

raoyongming commented 2 years ago

Hi, thanks for your interest in our work. We didn't test dilated backbones because we want to keep the pre-trained CLIP backbones unchanged. Since our method largely relies on the pre-trained correlation between visual and text embeddings, modifying the backbone to a dilated version may make the initial pixel-text score maps inaccurate.

xiaoachen98 commented 2 years ago

Thanks for your quick reply! I will try the effect of using dilated backbone.