mlfoundations / open_clip

An open source implementation of CLIP.
Other
9.71k stars 950 forks source link

RuntimeError: The shape of the 2D attn_mask is torch.Size([77, 77]), but should be (128, 128). #910

Closed KYKong97 closed 1 month ago

KYKong97 commented 2 months ago

I found this error when I upgrade the open clip torch to 2.26.1

Details

Traceback (most recent call last): File "train_net.py", line 340, in launch( File "/home/fc-clip/detectron2/detectron2/engine/launch.py", line 84, in launch main_func(args) File "train_net.py", line 325, in main res = Trainer.test(cfg, model) File "/home/fc-clip/detectron2/detectron2/engine/defaults.py", line 621, in test results_i = inference_on_dataset(model, data_loader, evaluator) File "/home/fc-clip/detectron2/detectron2/evaluation/evaluator.py", line 165, in inference_on_dataset outputs = model(inputs) File "/opt/conda/envs/fcclip2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/home/fc-clip/fcclip/fcclip.py", line 324, in forward text_classifier, num_templates = self.get_text_classifier() File "/home/fc-clip/fcclip/fcclip.py", line 208, in get_text_classifier text_classifier.append(self.backbone.get_text_classifier(self.test_class_names[idx:idx+bs], self.device).detach()) File "/home/fc-clip/fcclip/modeling/backbone/clip.py", line 211, in get_text_classifier text_features = self.encode_text(text_tokens, normalize=False) File "/home/fc-clip/fcclip/modeling/backbone/clip.py", line 95, in encode_text x = self.clip_model.transformer(x, attn_mask=self.clip_model.attn_mask) File "/opt/conda/envs/fcclip2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, *kwargs) File "/opt/conda/envs/fcclip2/lib/python3.8/site-packages/open_clip/transformer.py", line 363, in forward x = r(x, attn_mask=attn_mask) File "/opt/conda/envs/fcclip2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/opt/conda/envs/fcclip2/lib/python3.8/site-packages/open_clip/transformer.py", line 263, in forward x = q_x + self.ls_1(self.attention(q_x=self.ln_1(q_x), k_x=k_x, v_x=v_x, attn_mask=attn_mask)) File "/opt/conda/envs/fcclip2/lib/python3.8/site-packages/open_clip/transformer.py", line 250, in attention return self.attn( File "/opt/conda/envs/fcclip2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/envs/fcclip2/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 1031, in forward attn_output, attn_output_weights = F.multi_head_attention_forward( File "/opt/conda/envs/fcclip2/lib/python3.8/site-packages/torch/nn/functional.py", line 4992, in multi_head_attention_forward raise RuntimeError(f"The shape of the 2D attn_mask is {attn_mask.shape}, but should be {correct_2d_size}.") RuntimeError: The shape of the 2D attn_mask is torch.Size([77, 77]), but should be (128, 128).

rwightman commented 2 months ago

@KYKong97 looks like fc-clip is integrating at a low level and calling transformer layers directly, it needs to match the LND vs NLD order that changed here

liuhengyue commented 2 months ago

need to set clip_model.transformer.batch_first = False in the new release.

KYKong97 commented 1 month ago

Thanks. I managed to run the code now.

JoEarl commented 1 month ago

I appreciate it. Thanks a lot.