IndexError: index out of range in self during ViltForImagesAndTextClassification fine-tuning

shantanu778 commented 1 year ago

System Info

I am running on google colab. Though I got the same error for GPU. Here I am showing without GPU information

transformers version: 4.25.1
Platform: Linux-5.10.147+-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.11.1
PyTorch version (GPU?): 1.13.1+cu116 (False)
Tensorflow version (GPU?): 2.9.2 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Datasets

text	images
moorhen swamphen	[image1.jpg, image2.jpg, image3.jpg, image4.jpg, image5.jpg, image6.jpg, image7.jpg, image8.jpg, image9.jpg, image10.jpg]
--	--

According to the dataset, I have to pass 1 text with 10 images. So my input shape:

pixel_values: torch.Size([6, 10, 3, 384, 384]) 
pixel_mask: torch.Size([6, 10, 384, 384]) 
Input_ids: torch.Size([6, 9])

According to the forward function of ViltForImagesAndTextClassification I can pass num_images while calling the model.

But During training the model, showing the following error:

IndexError                                Traceback (most recent call last)

[<ipython-input-27-191138835385>](https://localhost:8080/#) in <module>
     70         # encoding = base_processor(images, batch[1], return_tensors="pt")
     71 
---> 72         outputs = model(input_ids=batch['input_ids'], pixel_values=batch['pixel_values'], labels=batch['labels'])
     73 
     74         # print(outputs)

8 frames

[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1192         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194             return forward_call(*input, **kwargs)
   1195         # Do not call functions when jit is used
   1196         full_backward_hooks, non_full_backward_hooks = [], []

[<ipython-input-23-da8fb21f3dcd>](https://localhost:8080/#) in forward(self, input_ids, attention_mask, token_type_ids, pixel_values, pixel_mask, head_mask, inputs_embeds, image_embeds, labels, output_attentions, output_hidden_states, return_dict)
     64 
     65           # forward every image through the model
---> 66           outputs = self.vilt(
     67               input_ids,
     68               attention_mask=attention_mask,

[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1192         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194             return forward_call(*input, **kwargs)
   1195         # Do not call functions when jit is used
   1196         full_backward_hooks, non_full_backward_hooks = [], []

[/usr/local/lib/python3.8/dist-packages/transformers/models/vilt/modeling_vilt.py](https://localhost:8080/#) in forward(self, input_ids, attention_mask, token_type_ids, pixel_values, pixel_mask, head_mask, inputs_embeds, image_embeds, image_token_type_idx, output_attentions, output_hidden_states, return_dict)
    836         head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
    837 
--> 838         embedding_output, attention_mask = self.embeddings(
    839             input_ids,
    840             attention_mask,

[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1192         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194             return forward_call(*input, **kwargs)
   1195         # Do not call functions when jit is used
   1196         full_backward_hooks, non_full_backward_hooks = [], []

[/usr/local/lib/python3.8/dist-packages/transformers/models/vilt/modeling_vilt.py](https://localhost:8080/#) in forward(self, input_ids, attention_mask, token_type_ids, pixel_values, pixel_mask, inputs_embeds, image_embeds, image_token_type_idx)
    231             torch.zeros_like(attention_mask, dtype=torch.long, device=text_embeds.device)
    232         )
--> 233         image_embeds = image_embeds + self.token_type_embeddings(
    234             torch.full_like(image_masks, image_token_type_idx, dtype=torch.long, device=text_embeds.device)
    235         )

[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1192         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194             return forward_call(*input, **kwargs)
   1195         # Do not call functions when jit is used
   1196         full_backward_hooks, non_full_backward_hooks = [], []

[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/sparse.py](https://localhost:8080/#) in forward(self, input)
    158 
    159     def forward(self, input: Tensor) -> Tensor:
--> 160         return F.embedding(
    161             input, self.weight, self.padding_idx, self.max_norm,
    162             self.norm_type, self.scale_grad_by_freq, self.sparse)

[/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py](https://localhost:8080/#) in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2208         # remove once script supports set_grad_enabled
   2209         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2210     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   2211 
   2212 

IndexError: index out of range in self

But while I am changing the value of image_token_type_idx= i + 1 to image_token_type_idx=1 in forward function during passing the images in the vilt model in the following snippet, it is working fine.

for i in range(num_images):
            # forward every image through the model
            outputs = self.vilt(
                input_ids,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids,
                pixel_values=pixel_values[:, i, :, :, :] if pixel_values is not None else None,
                pixel_mask=pixel_mask[:, i, :, :] if pixel_mask is not None else None,
                head_mask=head_mask,
                inputs_embeds=inputs_embeds,
                image_embeds=image_embeds[:, i, :, :] if image_embeds is not None else None,
                image_token_type_idx=i + 1,
                output_attentions=output_attentions,
                output_hidden_states=output_hidden_states,
                return_dict=return_dict,
            )

Expected behavior

According to the documentation, there should not be any problem.

sgugger commented 1 year ago

cc @NielsRogge and @alaradirik

alaradirik commented 1 year ago

Hi @shantanu778, your input shapes seem correct but could you provide a minimal code example that reproduces the error?

shantanu778 commented 1 year ago

As you can see the error is in the forward function. I actually didn't changed a lot in ViltForImagesAndTextClassification class. Here is my CustomModel:

class CustomModel(PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)

        # print(config)
        self.num_labels = config.num_labels
        self.vilt = ViltModel(config)

        # Classifier head
        num_images = config.num_images
        self.classifier = nn.Linear(config.hidden_size * num_images, config.num_labels)

    def forward(
        self,
        input_ids = None,
        attention_mask = None,
        token_type_ids = None,
        pixel_values = None,
        pixel_mask = None,
        head_mask = None,
        inputs_embeds = None,
        image_embeds = None,
        labels = None,
        output_attentions = None,
        output_hidden_states = None,
        return_dict = None,
    ):
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        # print(input_ids)
        # print(pixel_values.size())
        if pixel_values is not None and pixel_values.ndim == 4:
            # add dummy num_images dimension
            pixel_values = pixel_values.unsqueeze(1)

        if image_embeds is not None and image_embeds.ndim == 3:
            # add dummy num_images dimension
            image_embeds = image_embeds.unsqueeze(1)

        num_images = pixel_values.shape[1] if pixel_values is not None else None
        # print(num_images)
        if num_images is None:
            num_images = image_embeds.shape[1] if image_embeds is not None else None
        if num_images != self.config.num_images:
            raise ValueError(
                "Make sure to match the number of images in the model with the number of images in the input."
            )
        pooler_outputs = []
        hidden_states = [] if output_hidden_states else None
        attentions = [] if output_attentions else None
        for i in range(num_images):
          # print(i)
          # print(input_ids)
          # print(pixel_values[:, i, :, :, :])

          # forward every image through the model
          outputs = self.vilt(
              input_ids,
              attention_mask=attention_mask,
              token_type_ids=token_type_ids,
              pixel_values=pixel_values[:, i, :, :, :] if pixel_values is not None else None,
              pixel_mask=pixel_mask[:, i, :, :] if pixel_mask is not None else None,
              head_mask=head_mask,
              inputs_embeds=inputs_embeds,
              image_embeds=image_embeds[:, i, :, :] if image_embeds is not None else None,
              image_token_type_idx=i+1,
              output_attentions=output_attentions,
              output_hidden_states=output_hidden_states,
              return_dict=return_dict,
          )
          # print("="*20)
          # print(outputs)
          pooler_output = outputs.pooler_output if return_dict else outputs[1]
          # print("="*20)
          # print(pooler_output)
          pooler_outputs.append(pooler_output)
          if output_hidden_states:
              hidden_states.append(outputs.hidden_states)
          if output_attentions:
              attentions.append(outputs.attentions)

        pooled_output = torch.cat(pooler_outputs, dim=-1)
        logits = self.classifier(pooled_output)

        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            # print(labels)
            loss = loss_fct(logits.view(-1, self.num_labels), labels)

        if not return_dict:
            output = (logits, hidden_states, attentions)
            return ((loss,) + output) if loss is not None else output

        return ViltForImagesAndTextClassificationOutput(
            loss=loss,
            logits=logits,
            hidden_states=hidden_states,
            attentions=attentions,
        )

I don't know where is the exact problem. But after passing image_token_type_idx= 1, I didn't get any error.

alaradirik commented 1 year ago

Hi @shantanu778 could you provide a complete example, including the toy inputs, batch generation and the forward pass so that we can replicate the error?

Are you trying to customize the model or is the CustomModel class is just meant to fix an existing issue?

shantanu778 commented 1 year ago

@alaradirik First of all, CustomModel is mainly meant to fix an existing issue. Because when I tried to fine-tune ViltForImagesAndTextClassification, I got above error. Then, I created customModel class as like as your source code and fix the issue by editing image_token_type_idx in forward function. But I am not sure is it right or wrong way to fix it.

Now I am trying to Describe my task, I have a text and 10 images and I have to find the correct image from the 10 images. I wanted solve this problem as Multi-label Classification.

Dataset text	images	gold_image
gangster outlaw	['image.166.jpg','image.173.jpg', 'image.172.jpg','image.165.jpg', 'image.174.jpg','image.170.jpg','image.171.jpg', 'image.167.jpg'image.168.jpg','image.169.jpg']	'image.165.jpg'

Custom Dataset

class ImageTextDataset(Dataset):
    def __init__(self, data_dir, train_df, data_type, device, text_augmentation=False):
        self.data_type = data_type
        self.transforms = transforms.Compose([transforms.Resize([512,512]),transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
        self.data_dir = data_dir
        if self.data_type == "train" or self.data_type == "valid":
          self.all_image_names = list(train_df['images'])
          self.context = list(train_df['text'])
          self.gold_images = list(train_df['gold_image'])

        else:
          raise ValueError("Invalid data type. Expected one of: %s" % self.data_type)

    def __len__(self):
        return len(self.context)

    def __getitem__(self, idx):
        # Load the image and text
        context = self.context[idx]
        #loading images
        if self.data_type=='train' or self.data_type == 'valid':
          label = []
          images = self.all_image_names[idx]
          image = []
          for i, im in enumerate(images):
              path = os.path.join(self.data_dir, im)
              img = Image.open(path)
              if img.mode != "RGB":
                  img = img.convert('RGB')
              img = self.transforms(img)
              image.append(img)
              label.append(1.0) if im == self.gold_images[idx] else label.append(0.0)

          sample = {'context':context, 'images': image, 'label': label}

        else:
          raise ValueError("Invalid data type. Expected one of: %s" % self.data_type)
        return sample

Custom Data collator Function

def custom_collate(batch, processor):
  tokenizer = processor['tokenizer']
  feature_extractor = processor['feature_extractor']
  dic = {}
  context = []
  images = []
  labels = []
  for item in batch:
    context.append(item['context'])
    images.append(item['images'])
    labels.append(item['label'])

  pixel_masks, pixel_values= [], [],
  for idx, s in enumerate(images):
    # print(s)
    pixel_mask, pixel_value, label = [], [], []
    for jdx, img in enumerate(s):
      # print(img.size())
      # print(img.size())
      feature_encoding = feature_extractor(img, return_tensors="pt")
      pixel_mask.append(feature_encoding['pixel_mask'].squeeze(0))
      pixel_value.append(feature_encoding['pixel_values'].squeeze(0))
    pixel_mask = torch.stack(pixel_mask)
    pixel_value = torch.stack(pixel_value)

    pixel_masks.append(pixel_mask)
    pixel_values.append(pixel_value)

  encoding = tokenizer(context, return_tensors="pt", padding=True ,truncation=True, max_length=40)
  encoding['pixel_values'] = torch.stack(pixel_values)
  encoding['pixel_mask'] = torch.stack(pixel_masks)
  encoding['labels'] = torch.as_tensor(labels)
  return encoding

Training Script

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = "dandelin/vilt-b32-finetuned-coco"
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
feature_extractor = ViltFeatureExtractor.from_pretrained(checkpoint)
processor = {
    'tokenizer': tokenizer,
    'feature_extractor': feature_extractor
}
model=CustomModel(config = ViltConfig.from_pretrained(checkpoint, output_attentions=True,output_hidden_states=True, num_images=10, num_labels=10,  problem_type="multi_label_classification"))
model.to(device)
print(model.config.architectures[0])
# Create the dataset
train_ds = ImageTextDataset('/train_images_v1', train, data_type="train",device = device, text_augmentation=True)
# Create the dataloader
train_dataloader = DataLoader(train_ds, shuffle=True, batch_size=6, collate_fn=lambda batch: custom_collate(batch, processor))
print(len(train_dataloader))
# model.to(device)
lr = 5e-5
optimizer = AdamW(model.parameters(), lr=lr)
num_epochs = 2
num_training_steps = num_epochs * len(train_dataloader)
progress_bar_train = tqdm(range(num_training_steps))
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)
for i in range(num_epochs):
  total_loss = 0
  print(f"Epoch {i+1}")
  model.train()
  for batch in train_dataloader:
    batch.to(device)
    outputs = model(input_ids=batch['input_ids'], pixel_values=batch['pixel_values'], labels=batch['labels'])
    loss = outputs.loss
    loss.backward()
    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
    progress_bar_train.update(1)

Now if you use ViltForImagesAndTextClassification for fine-tuning, you will encounter the error. Then if you use my CustomModel in my previous comment, it will solve the issue.

N:B: I never created issue before therefore I don't know the proper way to explain the problem and task. Sorry for your inconvenience.

alaradirik commented 1 year ago

Hi @shantanu778, could you provide a minimal code example that reproduces the error without the custom class?

shantanu778 commented 1 year ago

I don't know how to give u minimal code example, I describe before what I wanted to do. if u try to fine-tune ViltForImagesAndTextClassification with 10 images instead of 2, I think you will able to generate the error. In my case, instead of using CustomClass use ViltForImagesAndTextClassification rest of them are as like as I mentioned earlier. @alaradirik

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

miandai commented 1 year ago

A simple solution：set modality_type_vocab_size = num_images+1

huggingface / transformers