Open yanbai1993 opened 3 months ago
@tsb0601 ??
@yanbai1993
I think the code is set up for image size 224, so the number of patches for both clip and dino is 256, which sums up to 512. As shown in the above table.
The below table shows results for image size 336. Here you can see number of patches for both clip and dino is 576, which sums up to 1152.
@tsb0601 Is clip-vit-large-patch14 used in LLaVA-1.5 for image size 224 ?
Hi! why is the dinov2 num_patches set to 256? the image size is 336, and the kernel size is 14. the num patches should be the same to clip, which is 576.