RuntimeError: EncoderDecoder: VisionMambaSeg: shape '[-1, 14, 14, 192]' is invalid for input of size 37824

VickyHuang1113 commented 1 month ago

I tried to fine-tune the segmentation model using the pretrained Vim-T, but encountered the following issue while executing bash scripts/ft_vim_tiny_upernet.sh:

Position interpolate from 14x14 to 32x32   
Traceback (most recent call last):
  File "/home/vic1113/miniconda3/envs/vim_seg/lib/python3.9/site-packages/mmcv/utils/registry.py", line 69, 
in build_from_cfg return obj_cls(**args)
  File "/home/vic1113/PrMamba/seg/backbone/vim.py", line 89, in __init__
    self.init_weights(pretrained)
  File "/home/vic1113/PrMamba/seg/backbone/vim.py", line 143, in init_weights
    interpolate_pos_embed(self, state_dict_model)
  File "/home/vic1113/PrMamba/vim/utils.py", line 258, in interpolate_pos_embed
    pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2)
RuntimeError: shape '[-1, 14, 14, 192]' is invalid for input of size 37824

This error is propagated through multiple functions, resulting in the final error: RuntimeError: EncoderDecoder: VisionMambaSeg: shape '[-1, 14, 14, 192]' is invalid for input of size 37824.

The pretrained weight I used was vim_t_midclstok_76p1acc.pth, which seems to be the correct one. If not, there should be an error while loading, such as size mismatch for norm_f.weight: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([384]), but I didn't get this error.

So, I guess there might be an issue with the model settings, but I’m not sure. 37824 = (1414 + 1) 192, and the "+1" is the part that leads to the error. If the "+1" part is for mid cls token, should I just drop it for the segmentation model?

Have anyone ever encountered this problem, or successfully finetuned a segmentation model?

Thank you very much!

GIT-HYQ commented 1 month ago

same issue，have you fix it now?

VickyHuang1113 commented 1 month ago

No, I can't apply the pretrained weights to the segmentation model. It seems the shapes of the backbones are different, and we might need to retrain it.

hustvl / Vim

RuntimeError: EncoderDecoder: VisionMambaSeg: shape '[-1, 14, 14, 192]' is invalid for input of size 37824 #108