OWL-ViT pre-trained models cannot accept some of the longest descriptions

HarukiNishimura-TRI commented 7 months ago

Dear authors,

Thank you for your work and the release of the d-cube dataset.

I was trying to run a pre-trained OWL-ViT model (e.g. "google/owlvit-base-patch32") on the dataset, and found the following sentences to yield a RuntimeError.

 ID: 140, TEXT: "a person who wears a hat and holds a tennis racket on the tennis court",
 ID: 146, TEXT: "the player who is ready to bat with both feet leaving the ground in the room",
 ID: 253, TEXT: "a person who plays music with musical instrument surrounded by spectators on the street",
 ID: 342, TEXT: "a fisher who stands on the shore and whose lower body is not submerged by water",
 ID: 348, TEXT: "a person who stands on the stage for speech but don't open their mouths",
 ID: 355, TEXT: "a person with a pen in one hand but not looking at the paper",
 ID: 356, TEXT: "a billiard ball with no numbers or patterns on its surface on the table",
 ID: 364, TEXT: "a person standing at the table of table tennis who is not waving table tennis rackets",
 ID: 404, TEXT: "a water polo player who is in the water but does not hold the ball",
 ID: 405, TEXT: "a barbell held by a weightlifter that has not been lifted above the head",
 ID: 412, TEXT: "a person who wears a helmet and sling equipment but is not on the sling",
 ID: 419, TEXT: "person who kneels on one knee and proposes but has nothing in his hand"

A typical error message is shown at the bottom. It seems that the pre-trained model uses max_position_embeddings = 16 in OwlViTTextConfig which is not long enough to accept the descriptions above as inputs. All the models available on Huggingface seem to use max_position_embeddings = 16. Did you encounter the same issue when running your experiments for the paper? If so, how did you handle it in the evaluation process?

Thanks in advance.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[129], line 1
----> 1 results = get_prediction(processor, model, image, [text_list[0]])

Cell In[11], line 13, in get_prediction(processor, model, image, captions, cpu_only)
      9 with torch.no_grad():
     10     inputs = processor(text=[captions], images=image, return_tensors="pt").to(
     11         device
     12     )
---> 13     outputs = model(**inputs)
     14 target_size = torch.Tensor([image.size[::-1]]).to(device)
     15 results = processor.post_process_object_detection(
     16     outputs=outputs, target_sizes=target_size, threshold=0.05
     17 )

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:1640, in OwlViTForObjectDetection.forward(self, input_ids, pixel_values, attention_mask, output_attentions, output_hidden_states, return_dict)
   1637 return_dict = return_dict if return_dict is not None else self.config.return_dict
   1639 # Embed images and text queries
-> 1640 query_embeds, feature_map, outputs = self.image_text_embedder(
   1641     input_ids=input_ids,
   1642     pixel_values=pixel_values,
   1643     attention_mask=attention_mask,
   1644     output_attentions=output_attentions,
   1645     output_hidden_states=output_hidden_states,
   1646 )
   1648 # Text and vision model outputs
   1649 text_outputs = outputs.text_model_output

File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:1385, in OwlViTForObjectDetection.image_text_embedder(self, input_ids, pixel_values, attention_mask, output_attentions, output_hidden_states)
   1376 def image_text_embedder(
   1377     self,
   1378     input_ids: torch.Tensor,
   (...)
   1383 ) -> Tuple[torch.FloatTensor]:
   1384     # Encode text and image
-> 1385     outputs = self.owlvit(
   1386         pixel_values=pixel_values,
   1387         input_ids=input_ids,
   1388         attention_mask=attention_mask,
   1389         output_attentions=output_attentions,
   1390         output_hidden_states=output_hidden_states,
   1391         return_dict=True,
   1392     )
   1394     # Get image embeddings
   1395     last_hidden_state = outputs.vision_model_output[0]

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:1163, in OwlViTModel.forward(self, input_ids, pixel_values, attention_mask, return_loss, output_attentions, output_hidden_states, return_base_image_embeds, return_dict)
   1155 vision_outputs = self.vision_model(
   1156     pixel_values=pixel_values,
   1157     output_attentions=output_attentions,
   1158     output_hidden_states=output_hidden_states,
   1159     return_dict=return_dict,
   1160 )
   1162 # Get embeddings for all text queries in all batch samples
-> 1163 text_outputs = self.text_model(
   1164     input_ids=input_ids,
   1165     attention_mask=attention_mask,
   1166     output_attentions=output_attentions,
   1167     output_hidden_states=output_hidden_states,
   1168     return_dict=return_dict,
   1169 )
   1171 text_embeds = text_outputs[1]
   1172 text_embeds = self.text_projection(text_embeds)

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:798, in OwlViTTextTransformer.forward(self, input_ids, attention_mask, position_ids, output_attentions, output_hidden_states, return_dict)
    796 input_shape = input_ids.size()
    797 input_ids = input_ids.view(-1, input_shape[-1])
--> 798 hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
    800 # num_samples, seq_len = input_shape  where num_samples = batch_size * num_max_text_queries
    801 # OWLVIT's text model uses causal mask, prepare it here.
    802 # https://github.com/openai/CLIP/blob/cfcffb90e69f37bf2ff1e988237a0fbe41f33c04/clip/model.py#L324
    803 causal_attention_mask = _create_4d_causal_attention_mask(
    804     input_shape, hidden_states.dtype, device=hidden_states.device
    805 )

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:332, in OwlViTTextEmbeddings.forward(self, input_ids, position_ids, inputs_embeds)
    329     inputs_embeds = self.token_embedding(input_ids)
    331 position_embeddings = self.position_embedding(position_ids)
--> 332 embeddings = inputs_embeds + position_embeddings
    334 return embeddings

RuntimeError: The size of tensor a (18) must match the size of tensor b (16) at non-singleton dimension 1

Charles-Xie commented 7 months ago

Hi Hakuri,

Thanks for your interest. About your question, for OWL-ViT, we did skipped these several sentences with the try-except sentence during evaluation. Other methods we evaluated does not have this constraint on input length, so no need for this processing is required for them. I think simply truncating the input string length to 16 might be a better solution for this and we will have a try on this. If you have further questions, please feel free to send me emails.

Best regards, Chi

HarukiNishimura-TRI commented 7 months ago

Hi Chi,

Thank you for clarification. So you omitted those sentences for the inter-scenario case as well?

Regards, Haruki

Charles-Xie commented 7 months ago

Hi Chi,

Thank you for clarification. So you omitted those sentences for the inter-scenario case as well?

Regards, Haruki

@HarukiNishimura-TRI Yes, I think so, for owl-vit. I think it would be better for inference on owl-vit to truncate the descriptions to 16 letters and use them for inference.

shikras / d-cube

OWL-ViT pre-trained models cannot accept some of the longest descriptions #13