for config denseclip_fpn_vit-b_640x640_80k.py:
in text_encoder:
embed_dim=512, while ViT-B-16.pt has embed_dim=1024,
when loading weight, it turns out that :"RuntimeError: Error(s) in loading state_dict for CLIPTextContextEncoder:
size mismatch for text_projection: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in curren
t model is torch.Size([512, 512])"
How do you deal with this problem?
for config denseclip_fpn_vit-b_640x640_80k.py: in text_encoder: embed_dim=512, while ViT-B-16.pt has embed_dim=1024, when loading weight, it turns out that :"RuntimeError: Error(s) in loading state_dict for CLIPTextContextEncoder: size mismatch for text_projection: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in curren t model is torch.Size([512, 512])" How do you deal with this problem?