raoyongming / DenseCLIP

[CVPR 2022] DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
505 stars 38 forks source link

Question about CLIPVisionTransformer #20

Closed aniki-ly closed 2 years ago

aniki-ly commented 2 years ago

Hi, Thanks for your work DenseCLIP.

I have some question about the CLIPVisionTransformer.


x = self.conv1(x)  # shape = [*, width, grid, grid]
B, C, H, W = x.shape
x = x.reshape(x.shape[0], x.shape[1], -1)  # shape = [*, width, grid ** 2]
x = x.permute(0, 2, 1)  # shape = [*, grid ** 2, width]
x = torch.cat([self.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x], dim=1)  # shape = [*, grid ** 2 + 1, width]

pos = self.positional_embedding.to(x.dtype)
cls_pos = pos[0,:] + self.class_embedding.to(x.dtype)
spatial_pos = F.interpolate(pos[1:,].reshape(1, self.spatial_size, self.spatial_size, C).permute(0, 3, 1, 2), size=(H, W), mode='bilinear')
spatial_pos = spatial_pos.reshape(1, C, H*W).permute(0, 2, 1)
pos = torch.cat([cls_pos.reshape(1, 1, C), spatial_pos], dim=1)
x = x + pos
x = self.ln_pre(x)
x = x.permute(1, 0, 2)  # NLD -> LND

the x have both image feature and class embedding. However, the cls_pos are also added with class_embedding. I think there is some conflict with origin CLIP code, could you tell me the reason for this operation?

raoyongming commented 2 years ago

Hi, thanks for your interest in our work. It seems a bug since the class embedding is used two times. We did not directly use the class embedding for dense prediction tasks so I think the final results might be affected significantly (we can even remove the class embedding during finetuning).

aniki-ly commented 2 years ago

Thanks!