jefferyZhan / Griffon

【ECCV2024】The official repo of Griffon series
Apache License 2.0
102 stars 6 forks source link

position embedding size mismatch #18

Open sunzc-sunny opened 1 day ago

sunzc-sunny commented 1 day ago

Hi, I followed your "Get Started" steps to run the demo code and also completed the "resize clip model" step. However, I encountered the following warning and error:

‘’‘ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:2025: UserWarning: for vision_model.post_layernorm.bias: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean to pass assign=True to assign items in the state dictionary to their corresponding key in the module instead of copying them in place?)

RuntimeError: Error(s) in loading state_dict for CLIPVisionModel: size mismatch for vision_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([257, 1024]) from checkpoint, the shape in current model is torch.Size([1025, 1024]). ’‘’

Could you please help me resolve this position embedding size mismatch issue? Thanks in advance.

jefferyZhan commented 1 day ago

This error indicates that you are still using the ViT with the input size of 224. Please check whether you have passed the true modified ViT path or if the position embedding modification has succeeded.

sunzc-sunny commented 1 day ago

Thank you for your time. I encountered an unusual error with the checkpoint. When loading the model using: state_dict = torch.load("checkpoints/clip-vit-large-patch14-448_v3/pytorch_model.bin") the position_embedding.weight has a size of torch.Size([1025, 1024]). However, when using: model = CLIPModel.from_pretrained('checkpoints/clip-vit-large-patch14-448_v3') the size changes to torch.Size([257, 1024]). I've verified that the configuration files are correctly copied using the tools provided. This discrepancy is quite puzzling. I'm currently working with Transformers version 4.31.0.

To address this issue, I employed a straightforward workaround by loading the parameters twice in the clip_encoder.py:

self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name, ignore_mismatched_sizes=True)
state_dict = torch.load("checkpoints/clip-vit-large-patch14-448_v3/pytorch_model.bin")
self.vision_tower.load_state_dict(state_dict, strict=False)

I hope this provides clarity on the issue. And welcome to find the real problem of this error~