raoyongming / DenseCLIP

[CVPR 2022] DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
505 stars 38 forks source link

question about any backbone experiments on ADE20K segmentation #7

Closed wanglixilinx closed 2 years ago

wanglixilinx commented 2 years ago

Hi, @raoyongming, thanks very much for your great work. I just have some questions about any backbone experiments on ADE20K segmentation in table 5. 我想问一下,针对没有clip 预训练的模型,例如RestNet18, Swintransformer-T/S, 我看到在ADE20k上提升提升不如RN50 显著。你们是直接进行的visual-text 特征交互计算吗? 有用到其他的一些trick吗?thanks!

raoyongming commented 2 years ago

We add an attention pool layer to the last stage of the backbone to obtain the visual embedding following the implementation of CLIP. The training details are similar to the DenseCLIP models in Table 1.

wanglixilinx commented 2 years ago

嗯,我比较好奇 你用的 attention pool 是直接用的RN50/RN101 预训练好的吗?除了attention pool, 其他的layers, 例如layer1-layer4, 都是随机初始化吗?然后在denseclip框架上end-to-end training?

wanglixilinx commented 2 years ago

如果方便的话,可以留个微信吗?方便向你请教!

raoyongming commented 2 years ago

The new attention pool is randomly initialized. We use the ImageNet pre-trained weights to initialize the backbone (i.e., stem, layer1~layer4). In this experiment, we want to show DenseCLIP is also useful to ImageNet pre-trained models. Therefore, we didn't use any CLIP weights in the visual encoder (ResNet/Swin + attention pool).

raoyongming commented 2 years ago

如果方便的话,可以留个微信吗?方便向你请教!

@wanglixilinx 我的微信是raoyongming95