More details about DINOV2 as backbone

syp2ysy / VRP-SAM

[CVPR 2024] Official implementation of "VRP-SAM: SAM with Visual Reference Prompt"

MIT License

74 stars 8 forks source link

More details about DINOV2 as backbone #12

Open wyxogo opened 2 months ago

wyxogo commented 2 months ago

Thanks for your work. Could you provide more details on the implementation of DINOV2 as a backbone, specifically regarding the layers and the concatenation process

syp2ysy commented 1 month ago

self.layer1, self.layer2, self.layer3, self.layer4 = nn.Sequential(self.dinov2.blocks[:8]), nn.Sequential(self.dinov2.blocks[8:12]), nn.Sequential(self.dinov2.blocks[12:21]), nn.Sequential(self.dinov2.blocks[21:24])

wyxogo commented 1 month ago

self.layer1, self.layer2, self.layer3, self.layer4 = nn.Sequential(self.dinov2.blocks[:8]), nn.Sequential(self.dinov2.blocks[8:12]), nn.Sequential(self.dinov2.blocks[12:21]), nn.Sequential(self.dinov2.blocks[21:24])

DINO_V2-large? base: 12 blocks; large: 24 blocks

syp2ysy commented 1 month ago

yes, this is the settiing of DINO_V2-large. if you use the DINO_V2-base, you can set self.layer1, self.layer2, self.layer3, self.layer4 = nn.Sequential(self.dinov2.blocks[:4]), nn.Sequential(self.dinov2.blocks[4:6]), nn.Sequential(self.dinov2.blocks[6:10]), nn.Sequential(self.dinov2.blocks[10:12])

syp2ysy commented 1 month ago

self.layer0 = self.dinov2.patch_embed

mioadxll commented 1 month ago

I want to know how to handle masks. Is the following correct? support_mask =F.interpolate(support_mask_ori.unsqueeze(1).float(), size=img_size//patch_size, mode='nearest').flatten(1)[:,:,None] In order to follow the corresponding configuration and better reproduce the performance effect, it would be greatly appreciated if you could provide the corresponding code

Nuyoah13 commented 1 month ago

I want to know how to handle masks. Is the following correct? support_mask =F.interpolate(support_mask_ori.unsqueeze(1).float(), size=img_size//patch_size, mode='nearest').flatten(1)[:,:,None] In order to follow the corresponding configuration and better reproduce the performance effect, it would be greatly appreciated if you could provide the corresponding code

hi, have you run as expected with dino-v2 encoder?

mioadxll commented 1 month ago

I want to know how to handle masks. Is the following correct? support_mask =F.interpolate(support_mask_ori.unsqueeze(1).float(), size=img_size//patch_size, mode='nearest').flatten(1)[:,:,None] In order to follow the corresponding configuration and better reproduce the performance effect, it would be greatly appreciated if you could provide the corresponding code

hi, have you run as expected with dino-v2 encoder?

Yes, but as I previously asked, I am not sure if this is consistent with the original author's configuration, and the experimental reproduction did not find any significant difference in effect compared to using ResNet101