raoyongming / DenseCLIP

[CVPR 2022] DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
505 stars 38 forks source link

Question about DenseCLIP for Any Visual Backbone #47

Open needsee opened 9 months ago

needsee commented 9 months ago

Congradulations on your great work! @raoyongming I had got some questions about any backbone experiments. I want to know more details about the any backbone experiments. Could you provide the codes for any backbone experiment? That will help understand a lot. Thanks!

needsee commented 9 months ago

If I use Swintransformer-T as the image encoder,the output image feature is [B, 768, 16, 12]. Is the attention pooling layer used to map image features to the embedded space([B,512,16,12]), then calculate similarity with text features? Can I replace it with a linear layer?

raoyongming commented 9 months ago

Yes, we use a randomly initialized attention pooling layer to map the image features into the embedding space. It might be okay to use a simpler linear layer but we haven't tried it in our experiments

needsee commented 9 months ago

Yes, we use a randomly initialized attention pooling layer to map the image features into the embedding space. It might be okay to use a simpler linear layer but we haven't tried it in our experiments Thanks for your reply. Could you please provide the codes of any backbone experiment? This is my email liuliyuan2023@bupt.edu.cn. Thanks.

needsee commented 8 months ago

@raoyongming 您好,请问您在做 any visual backbone 实验时,有做ImageNet pre-trained vit 的实验吗?我尝试了一下使用ImageNet pre-trained vit进行实验,结果没有提升,请问您觉得是什么原因呢?

raoyongming commented 8 months ago

你好,我们只在论文里面report的ResNet和Swin上做过实验。