salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence
BSD 3-Clause "New" or "Revised" License
9.18k stars 908 forks source link

transformers 4.27 compatability #227

Open gunesevitan opened 1 year ago

gunesevitan commented 1 year ago

I have to use transformers 4.27 because latest version of clip-interrogator requires that specific version. After upgrading transformers from 4.26 to 4.27, I had this issue.


╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/src/image_captioning/blip_.py:168   │
│ in <module>                                                                                      │
│                                                                                                  │
│   165 │   for step, inputs in enumerate(progress_bar):                                           │
│   166 │   │                                                                                      │
│   167 │   │   inputs = inputs.to(device)                                                         │
│ ❱ 168 │   │   batch_predictions = predict_blip(                                                  │
│   169 │   │   │   inputs=inputs,                                                                 │
│   170 │   │   │   model=blip_model,                                                              │
│   171 │   │   │   nucleus_sampling=False,                                                        │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/src/image_captioning/blip_.py:92 in │
│ predict_blip                                                                                     │
│                                                                                                  │
│    89 │   """                                                                                    │
│    90 │                                                                                          │
│    91 │   with torch.no_grad(), torch.autocast(device_type=device.type, dtype=torch.float16):    │
│ ❱  92 │   │   outputs = model.generate(                                                          │
│    93 │   │   │   samples={'image': inputs},                                                     │
│    94 │   │   │   use_nucleus_sampling=nucleus_sampling,                                         │
│    95 │   │   │   num_beams=num_beams,                                                           │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/blip_models/blip_caption.py:188 in generate                               │
│                                                                                                  │
│   185 │   │   prompt.input_ids = prompt.input_ids[:, :-1]                                        │
│   186 │   │                                                                                      │
│   187 │   │   # get decoded text                                                                 │
│ ❱ 188 │   │   decoder_out = self.text_decoder.generate_from_encoder(                             │
│   189 │   │   │   tokenized_prompt=prompt,                                                       │
│   190 │   │   │   visual_embeds=image_embeds,                                                    │
│   191 │   │   │   sep_token_id=self.tokenizer.sep_token_id,                                      │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/med.py:1363 in generate_from_encoder                                      │
│                                                                                                  │
│   1360 │   │   │   )                                                                             │
│   1361 │   │   else:                                                                             │
│   1362 │   │   │   # beam search                                                                 │
│ ❱ 1363 │   │   │   outputs = self.generate(                                                      │
│   1364 │   │   │   │   input_ids=tokenized_prompt.input_ids,                                     │
│   1365 │   │   │   │   max_length=max_length,                                                    │
│   1366 │   │   │   │   min_length=min_length,                                                    │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/torch/autograd/grad_mode.py:27 in decorate_context                                     │
│                                                                                                  │
│    24 │   │   @functools.wraps(func)                                                             │
│    25 │   │   def decorate_context(*args, **kwargs):                                             │
│    26 │   │   │   with self.clone():                                                             │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                                               │
│    28 │   │   return cast(F, decorate_context)                                                   │
│    29 │                                                                                          │
│    30 │   def _wrap_generator(self, func):                                                       │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/transformers/generation/utils.py:1490 in generate                                      │
│                                                                                                  │
│   1487 │   │   │   │   **model_kwargs,                                                           │
│   1488 │   │   │   )                                                                             │
│   1489 │   │   │   # 13. run beam search                                                         │
│ ❱ 1490 │   │   │   return self.beam_search(                                                      │
│   1491 │   │   │   │   input_ids,                                                                │
│   1492 │   │   │   │   beam_scorer,                                                              │
│   1493 │   │   │   │   logits_processor=logits_processor,                                        │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/transformers/generation/utils.py:2749 in beam_search                                   │
│                                                                                                  │
│   2746 │   │   │                                                                                 │
│   2747 │   │   │   model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)  │
│   2748 │   │   │                                                                                 │
│ ❱ 2749 │   │   │   outputs = self(                                                               │
│   2750 │   │   │   │   **model_inputs,                                                           │
│   2751 │   │   │   │   return_dict=True,                                                         │
│   2752 │   │   │   │   output_attentions=output_attentions,                                      │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/torch/nn/modules/module.py:1194 in _call_impl                                          │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/med.py:1213 in forward                                                    │
│                                                                                                  │
│   1210 │   │   if labels is not None:                                                            │
│   1211 │   │   │   use_cache = False                                                             │
│   1212 │   │                                                                                     │
│ ❱ 1213 │   │   outputs = self.bert(                                                              │
│   1214 │   │   │   input_ids,                                                                    │
│   1215 │   │   │   attention_mask=attention_mask,                                                │
│   1216 │   │   │   position_ids=position_ids,                                                    │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/torch/nn/modules/module.py:1194 in _call_impl                                          │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/med.py:977 in forward                                                     │
│                                                                                                  │
│    974 │   │   else:                                                                             │
│    975 │   │   │   embedding_output = encoder_embeds                                             │
│    976 │   │                                                                                     │
│ ❱  977 │   │   encoder_outputs = self.encoder(                                                   │
│    978 │   │   │   embedding_output,                                                             │
│    979 │   │   │   attention_mask=extended_attention_mask,                                       │
│    980 │   │   │   head_mask=head_mask,                                                          │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/torch/nn/modules/module.py:1194 in _call_impl                                          │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/med.py:595 in forward                                                     │
│                                                                                                  │
│    592 │   │   │   │   │   mode=mode,                                                            │
│    593 │   │   │   │   )                                                                         │
│    594 │   │   │   else:                                                                         │
│ ❱  595 │   │   │   │   layer_outputs = layer_module(                                             │
│    596 │   │   │   │   │   hidden_states,                                                        │
│    597 │   │   │   │   │   attention_mask,                                                       │
│    598 │   │   │   │   │   layer_head_mask,                                                      │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/torch/nn/modules/module.py:1194 in _call_impl                                          │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/med.py:478 in forward                                                     │
│                                                                                                  │
│    475 │   │   │   │   outputs = outputs + cross_attention_outputs[1:-1]                         │
│    476 │   │   │                                                                                 │
│    477 │   │   │   else:                                                                         │
│ ❱  478 │   │   │   │   cross_attention_outputs = self.crossattention(                            │
│    479 │   │   │   │   │   attention_output,                                                     │
│    480 │   │   │   │   │   attention_mask,                                                       │
│    481 │   │   │   │   │   head_mask,                                                            │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/torch/nn/modules/module.py:1194 in _call_impl                                          │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/med.py:349 in forward                                                     │
│                                                                                                  │
│    346 │   │   past_key_value=None,                                                              │
│    347 │   │   output_attentions=False,                                                          │
│    348 │   ):                                                                                    │
│ ❱  349 │   │   self_outputs = self.self(                                                         │
│    350 │   │   │   hidden_states,                                                                │
│    351 │   │   │   attention_mask,                                                               │
│    352 │   │   │   head_mask,                                                                    │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/torch/nn/modules/module.py:1194 in _call_impl                                          │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/med.py:222 in forward                                                     │
│                                                                                                  │
│    219 │   │   print('query', query_layer.shape)                                                 │
│    220 │   │   print('key', key_layer.shape)                                                     │
│    221 │   │   print('key t', key_layer.transpose(-1, -2).shape)                                 │
│ ❱  222 │   │   attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))         │
│    223 │   │                                                                                     │
│    224 │   │   if (                                                                              │
│    225 │   │   │   self.position_embedding_type == "relative_key"                                │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: The size of tensor a (48) must match the size of tensor b (144) at non-singleton dimension 0

I'm not sure if the first dimension 144 is correct here. What's happening in transformers 4.27 causing this?

HuangChiEn commented 1 year ago

I have to use transformers 4.27 because latest version of clip-interrogator requires that specific version. After upgrading transformers from 4.26 to 4.27, I had this issue.


╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/src/image_captioning/blip_.py:168   │
│ in <module>                                                                                      │
│                                                                                                  │
│   165 │   for step, inputs in enumerate(progress_bar):                                           │
│   166 │   │                                                                                      │
│   167 │   │   inputs = inputs.to(device)                                                         │
│ ❱ 168 │   │   batch_predictions = predict_blip(                                                  │
│   169 │   │   │   inputs=inputs,                                                                 │
│   170 │   │   │   model=blip_model,                                                              │
│   171 │   │   │   nucleus_sampling=False,                                                        │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/src/image_captioning/blip_.py:92 in │
│ predict_blip                                                                                     │
│                                                                                                  │
│    89 │   """                                                                                    │
│    90 │                                                                                          │
│    91 │   with torch.no_grad(), torch.autocast(device_type=device.type, dtype=torch.float16):    │
│ ❱  92 │   │   outputs = model.generate(                                                          │
│    93 │   │   │   samples={'image': inputs},                                                     │
│    94 │   │   │   use_nucleus_sampling=nucleus_sampling,                                         │
│    95 │   │   │   num_beams=num_beams,                                                           │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/blip_models/blip_caption.py:188 in generate                               │
│                                                                                                  │
│   185 │   │   prompt.input_ids = prompt.input_ids[:, :-1]                                        │
│   186 │   │                                                                                      │
│   187 │   │   # get decoded text                                                                 │
│ ❱ 188 │   │   decoder_out = self.text_decoder.generate_from_encoder(                             │
│   189 │   │   │   tokenized_prompt=prompt,                                                       │
│   190 │   │   │   visual_embeds=image_embeds,                                                    │
│   191 │   │   │   sep_token_id=self.tokenizer.sep_token_id,                                      │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/med.py:1363 in generate_from_encoder                                      │
│                                                                                                  │
│   1360 │   │   │   )                                                                             │
│   1361 │   │   else:                                                                             │
│   1362 │   │   │   # beam search                                                                 │
│ ❱ 1363 │   │   │   outputs = self.generate(                                                      │
│   1364 │   │   │   │   input_ids=tokenized_prompt.input_ids,                                     │
│   1365 │   │   │   │   max_length=max_length,                                                    │
│   1366 │   │   │   │   min_length=min_length,                                                    │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/torch/autograd/grad_mode.py:27 in decorate_context                                     │
│                                                                                                  │
│    24 │   │   @functools.wraps(func)                                                             │
│    25 │   │   def decorate_context(*args, **kwargs):                                             │
│    26 │   │   │   with self.clone():                                                             │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                                               │
│    28 │   │   return cast(F, decorate_context)                                                   │
│    29 │                                                                                          │
│    30 │   def _wrap_generator(self, func):                                                       │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/transformers/generation/utils.py:1490 in generate                                      │
│                                                                                                  │
│   1487 │   │   │   │   **model_kwargs,                                                           │
│   1488 │   │   │   )                                                                             │
│   1489 │   │   │   # 13. run beam search                                                         │
│ ❱ 1490 │   │   │   return self.beam_search(                                                      │
│   1491 │   │   │   │   input_ids,                                                                │
│   1492 │   │   │   │   beam_scorer,                                                              │
│   1493 │   │   │   │   logits_processor=logits_processor,                                        │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/transformers/generation/utils.py:2749 in beam_search                                   │
│                                                                                                  │
│   2746 │   │   │                                                                                 │
│   2747 │   │   │   model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)  │
│   2748 │   │   │                                                                                 │
│ ❱ 2749 │   │   │   outputs = self(                                                               │
│   2750 │   │   │   │   **model_inputs,                                                           │
│   2751 │   │   │   │   return_dict=True,                                                         │
│   2752 │   │   │   │   output_attentions=output_attentions,                                      │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/torch/nn/modules/module.py:1194 in _call_impl                                          │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/med.py:1213 in forward                                                    │
│                                                                                                  │
│   1210 │   │   if labels is not None:                                                            │
│   1211 │   │   │   use_cache = False                                                             │
│   1212 │   │                                                                                     │
│ ❱ 1213 │   │   outputs = self.bert(                                                              │
│   1214 │   │   │   input_ids,                                                                    │
│   1215 │   │   │   attention_mask=attention_mask,                                                │
│   1216 │   │   │   position_ids=position_ids,                                                    │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/torch/nn/modules/module.py:1194 in _call_impl                                          │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/med.py:977 in forward                                                     │
│                                                                                                  │
│    974 │   │   else:                                                                             │
│    975 │   │   │   embedding_output = encoder_embeds                                             │
│    976 │   │                                                                                     │
│ ❱  977 │   │   encoder_outputs = self.encoder(                                                   │
│    978 │   │   │   embedding_output,                                                             │
│    979 │   │   │   attention_mask=extended_attention_mask,                                       │
│    980 │   │   │   head_mask=head_mask,                                                          │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/torch/nn/modules/module.py:1194 in _call_impl                                          │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/med.py:595 in forward                                                     │
│                                                                                                  │
│    592 │   │   │   │   │   mode=mode,                                                            │
│    593 │   │   │   │   )                                                                         │
│    594 │   │   │   else:                                                                         │
│ ❱  595 │   │   │   │   layer_outputs = layer_module(                                             │
│    596 │   │   │   │   │   hidden_states,                                                        │
│    597 │   │   │   │   │   attention_mask,                                                       │
│    598 │   │   │   │   │   layer_head_mask,                                                      │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/torch/nn/modules/module.py:1194 in _call_impl                                          │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/med.py:478 in forward                                                     │
│                                                                                                  │
│    475 │   │   │   │   outputs = outputs + cross_attention_outputs[1:-1]                         │
│    476 │   │   │                                                                                 │
│    477 │   │   │   else:                                                                         │
│ ❱  478 │   │   │   │   cross_attention_outputs = self.crossattention(                            │
│    479 │   │   │   │   │   attention_output,                                                     │
│    480 │   │   │   │   │   attention_mask,                                                       │
│    481 │   │   │   │   │   head_mask,                                                            │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/torch/nn/modules/module.py:1194 in _call_impl                                          │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/med.py:349 in forward                                                     │
│                                                                                                  │
│    346 │   │   past_key_value=None,                                                              │
│    347 │   │   output_attentions=False,                                                          │
│    348 │   ):                                                                                    │
│ ❱  349 │   │   self_outputs = self.self(                                                         │
│    350 │   │   │   hidden_states,                                                                │
│    351 │   │   │   attention_mask,                                                               │
│    352 │   │   │   head_mask,                                                                    │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/torch/nn/modules/module.py:1194 in _call_impl                                          │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/med.py:222 in forward                                                     │
│                                                                                                  │
│    219 │   │   print('query', query_layer.shape)                                                 │
│    220 │   │   print('key', key_layer.shape)                                                     │
│    221 │   │   print('key t', key_layer.transpose(-1, -2).shape)                                 │
│ ❱  222 │   │   attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))         │
│    223 │   │                                                                                     │
│    224 │   │   if (                                                                              │
│    225 │   │   │   self.position_embedding_type == "relative_key"                                │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: The size of tensor a (48) must match the size of tensor b (144) at non-singleton dimension 0

I'm not sure if the first dimension 144 is correct here. What's happening in transformers 4.27 causing this?

yes, i just ask the same question yesterday, we need to downgrade the version of transformer.. you can see that requirement.txt have constraint the version of transformer package transformers>=4.25.0,<4.27 so it should less then 4.27!

at least 4.25 will work (i take this version)

gunesevitan commented 1 year ago

I have to use transformers 4.27 because latest version of clip-interrogator requires that specific version. After upgrading transformers from 4.26 to 4.27, I had this issue.


╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/src/image_captioning/blip_.py:168   │
│ in <module>                                                                                      │
│                                                                                                  │
│   165 │   for step, inputs in enumerate(progress_bar):                                           │
│   166 │   │                                                                                      │
│   167 │   │   inputs = inputs.to(device)                                                         │
│ ❱ 168 │   │   batch_predictions = predict_blip(                                                  │
│   169 │   │   │   inputs=inputs,                                                                 │
│   170 │   │   │   model=blip_model,                                                              │
│   171 │   │   │   nucleus_sampling=False,                                                        │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/src/image_captioning/blip_.py:92 in │
│ predict_blip                                                                                     │
│                                                                                                  │
│    89 │   """                                                                                    │
│    90 │                                                                                          │
│    91 │   with torch.no_grad(), torch.autocast(device_type=device.type, dtype=torch.float16):    │
│ ❱  92 │   │   outputs = model.generate(                                                          │
│    93 │   │   │   samples={'image': inputs},                                                     │
│    94 │   │   │   use_nucleus_sampling=nucleus_sampling,                                         │
│    95 │   │   │   num_beams=num_beams,                                                           │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/blip_models/blip_caption.py:188 in generate                               │
│                                                                                                  │
│   185 │   │   prompt.input_ids = prompt.input_ids[:, :-1]                                        │
│   186 │   │                                                                                      │
│   187 │   │   # get decoded text                                                                 │
│ ❱ 188 │   │   decoder_out = self.text_decoder.generate_from_encoder(                             │
│   189 │   │   │   tokenized_prompt=prompt,                                                       │
│   190 │   │   │   visual_embeds=image_embeds,                                                    │
│   191 │   │   │   sep_token_id=self.tokenizer.sep_token_id,                                      │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/med.py:1363 in generate_from_encoder                                      │
│                                                                                                  │
│   1360 │   │   │   )                                                                             │
│   1361 │   │   else:                                                                             │
│   1362 │   │   │   # beam search                                                                 │
│ ❱ 1363 │   │   │   outputs = self.generate(                                                      │
│   1364 │   │   │   │   input_ids=tokenized_prompt.input_ids,                                     │
│   1365 │   │   │   │   max_length=max_length,                                                    │
│   1366 │   │   │   │   min_length=min_length,                                                    │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/torch/autograd/grad_mode.py:27 in decorate_context                                     │
│                                                                                                  │
│    24 │   │   @functools.wraps(func)                                                             │
│    25 │   │   def decorate_context(*args, **kwargs):                                             │
│    26 │   │   │   with self.clone():                                                             │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                                               │
│    28 │   │   return cast(F, decorate_context)                                                   │
│    29 │                                                                                          │
│    30 │   def _wrap_generator(self, func):                                                       │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/transformers/generation/utils.py:1490 in generate                                      │
│                                                                                                  │
│   1487 │   │   │   │   **model_kwargs,                                                           │
│   1488 │   │   │   )                                                                             │
│   1489 │   │   │   # 13. run beam search                                                         │
│ ❱ 1490 │   │   │   return self.beam_search(                                                      │
│   1491 │   │   │   │   input_ids,                                                                │
│   1492 │   │   │   │   beam_scorer,                                                              │
│   1493 │   │   │   │   logits_processor=logits_processor,                                        │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/transformers/generation/utils.py:2749 in beam_search                                   │
│                                                                                                  │
│   2746 │   │   │                                                                                 │
│   2747 │   │   │   model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)  │
│   2748 │   │   │                                                                                 │
│ ❱ 2749 │   │   │   outputs = self(                                                               │
│   2750 │   │   │   │   **model_inputs,                                                           │
│   2751 │   │   │   │   return_dict=True,                                                         │
│   2752 │   │   │   │   output_attentions=output_attentions,                                      │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/torch/nn/modules/module.py:1194 in _call_impl                                          │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/med.py:1213 in forward                                                    │
│                                                                                                  │
│   1210 │   │   if labels is not None:                                                            │
│   1211 │   │   │   use_cache = False                                                             │
│   1212 │   │                                                                                     │
│ ❱ 1213 │   │   outputs = self.bert(                                                              │
│   1214 │   │   │   input_ids,                                                                    │
│   1215 │   │   │   attention_mask=attention_mask,                                                │
│   1216 │   │   │   position_ids=position_ids,                                                    │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/torch/nn/modules/module.py:1194 in _call_impl                                          │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/med.py:977 in forward                                                     │
│                                                                                                  │
│    974 │   │   else:                                                                             │
│    975 │   │   │   embedding_output = encoder_embeds                                             │
│    976 │   │                                                                                     │
│ ❱  977 │   │   encoder_outputs = self.encoder(                                                   │
│    978 │   │   │   embedding_output,                                                             │
│    979 │   │   │   attention_mask=extended_attention_mask,                                       │
│    980 │   │   │   head_mask=head_mask,                                                          │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/torch/nn/modules/module.py:1194 in _call_impl                                          │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/med.py:595 in forward                                                     │
│                                                                                                  │
│    592 │   │   │   │   │   mode=mode,                                                            │
│    593 │   │   │   │   )                                                                         │
│    594 │   │   │   else:                                                                         │
│ ❱  595 │   │   │   │   layer_outputs = layer_module(                                             │
│    596 │   │   │   │   │   hidden_states,                                                        │
│    597 │   │   │   │   │   attention_mask,                                                       │
│    598 │   │   │   │   │   layer_head_mask,                                                      │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/torch/nn/modules/module.py:1194 in _call_impl                                          │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/med.py:478 in forward                                                     │
│                                                                                                  │
│    475 │   │   │   │   outputs = outputs + cross_attention_outputs[1:-1]                         │
│    476 │   │   │                                                                                 │
│    477 │   │   │   else:                                                                         │
│ ❱  478 │   │   │   │   cross_attention_outputs = self.crossattention(                            │
│    479 │   │   │   │   │   attention_output,                                                     │
│    480 │   │   │   │   │   attention_mask,                                                       │
│    481 │   │   │   │   │   head_mask,                                                            │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/torch/nn/modules/module.py:1194 in _call_impl                                          │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/med.py:349 in forward                                                     │
│                                                                                                  │
│    346 │   │   past_key_value=None,                                                              │
│    347 │   │   output_attentions=False,                                                          │
│    348 │   ):                                                                                    │
│ ❱  349 │   │   self_outputs = self.self(                                                         │
│    350 │   │   │   hidden_states,                                                                │
│    351 │   │   │   attention_mask,                                                               │
│    352 │   │   │   head_mask,                                                                    │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/torch/nn/modules/module.py:1194 in _call_impl                                          │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home/gunes/Desktop/Kaggle/stable-diffusion-image-to-prompts/venv_competition/lib/python3.9/site │
│ -packages/lavis/models/med.py:222 in forward                                                     │
│                                                                                                  │
│    219 │   │   print('query', query_layer.shape)                                                 │
│    220 │   │   print('key', key_layer.shape)                                                     │
│    221 │   │   print('key t', key_layer.transpose(-1, -2).shape)                                 │
│ ❱  222 │   │   attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))         │
│    223 │   │                                                                                     │
│    224 │   │   if (                                                                              │
│    225 │   │   │   self.position_embedding_type == "relative_key"                                │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: The size of tensor a (48) must match the size of tensor b (144) at non-singleton dimension 0

I'm not sure if the first dimension 144 is correct here. What's happening in transformers 4.27 causing this?

yes, i just ask the same question yesterday, we need to downgrade the version of transformer.. you can see that requirement.txt have constraint the version of transformer package transformers>=4.25.0,<4.27 so it should less then 4.27!

at least 4.25 will work (i take this version)

Yeah, I figured that out but I have to use transformers 4.27 :/

LiJunnan1992 commented 1 year ago

We have made an update to BLIP-2 OPT models so that they can work with the latest transformers with version>=4.27.

gunesevitan commented 1 year ago

We have made an update to BLIP-2 OPT models so that they can work with the latest transformers with version>=4.27.

Does BLIP model work with transformers>=4.27 too?

LiJunnan1992 commented 1 year ago

BLIP model does not work with transformers>=4.27.

Alchemistyui commented 1 year ago

BLIP model does not work with transformers>=4.27.

May I know the reason why BLIP doesn't work with transformers>=4.27? I have to use transformers>4.27, is it possible that I modify transformers>4.27 locally to fit BLIP model? Thank you in advance.

LiJunnan1992 commented 1 year ago

@Alchemistyui You may refer to this change and this change that affect BLIP model's generate function.