I tried to fine-tune the segmentation model using the pretrained Vim-T, but encountered the following issue while executing bash scripts/ft_vim_tiny_upernet.sh:
Position interpolate from 14x14 to 32x32
Traceback (most recent call last):
File "/home/vic1113/miniconda3/envs/vim_seg/lib/python3.9/site-packages/mmcv/utils/registry.py", line 69,
in build_from_cfg return obj_cls(**args)
File "/home/vic1113/PrMamba/seg/backbone/vim.py", line 89, in __init__
self.init_weights(pretrained)
File "/home/vic1113/PrMamba/seg/backbone/vim.py", line 143, in init_weights
interpolate_pos_embed(self, state_dict_model)
File "/home/vic1113/PrMamba/vim/utils.py", line 258, in interpolate_pos_embed
pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2)
RuntimeError: shape '[-1, 14, 14, 192]' is invalid for input of size 37824
This error is propagated through multiple functions, resulting in the final error:
RuntimeError: EncoderDecoder: VisionMambaSeg: shape '[-1, 14, 14, 192]' is invalid for input of size 37824.
The pretrained weight I used was vim_t_midclstok_76p1acc.pth, which seems to be the correct one. If not, there should be an error while loading, such as size mismatch for norm_f.weight: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([384]), but I didn't get this error.
So, I guess there might be an issue with the model settings, but I’m not sure. 37824 = (1414 + 1) 192, and the "+1" is the part that leads to the error. If the "+1" part is for mid cls token, should I just drop it for the segmentation model?
Have anyone ever encountered this problem, or successfully finetuned a segmentation model?
No, I can't apply the pretrained weights to the segmentation model.
It seems the shapes of the backbones are different, and we might need to retrain it.
I tried to fine-tune the segmentation model using the pretrained Vim-T, but encountered the following issue while executing
bash scripts/ft_vim_tiny_upernet.sh
:This error is propagated through multiple functions, resulting in the final error:
RuntimeError: EncoderDecoder: VisionMambaSeg: shape '[-1, 14, 14, 192]' is invalid for input of size 37824
.The pretrained weight I used was
vim_t_midclstok_76p1acc.pth
, which seems to be the correct one. If not, there should be an error while loading, such assize mismatch for norm_f.weight: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([384])
, but I didn't get this error.So, I guess there might be an issue with the model settings, but I’m not sure. 37824 = (1414 + 1) 192, and the "+1" is the part that leads to the error. If the "+1" part is for mid cls token, should I just drop it for the segmentation model?
Have anyone ever encountered this problem, or successfully finetuned a segmentation model?
Thank you very much!