salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence
BSD 3-Clause "New" or "Revised" License
10.02k stars 975 forks source link

Issue related to BLIP2 `CaptionDataset` implementation or `blip2_qformer.py` for custom dataset pre-training stage 1 #772

Open abdel-habib opened 6 hours ago

abdel-habib commented 6 hours ago

While pre-training on a custom image-text dataset, I had some concerns with the implementation of both the CaptionDataset class and blip2_qformer.py file for handling the captioning datasets.

If you look at the blip2_qformer.py implementation, line 159, the if statement had this comment for using the image_id only for retrieval tasks, by checking the if the "image_id" is in the sample keys; Same with line 180.

if "image_id" in samples.keys(): #coco retrieval finetuning
      image_ids = samples["image_id"].view(-1,1)
            ...
      loss_itc = ...
else:                     
      loss_itc = ...

These two if statements trigger erros with custom image-text captioning dataset, idk how it didn't trigger an error using coco_caption_dataset.py as the COCOCapDataset is using CaptionDataset class implementation, and it is returning the image_id when getting an item.

By commenting the if statement (True) blocks in line 159 and 180, the pre-training on stage 1 with custom datasets runs perfectly. Is this an expected behaviour or am I missing something?

Also, samples["image_id"] seems to be a list of strings, even with coco file naming pattern, when getting an item using the custom dataset implementation, it returns a string as an id, so anything inside the if (true) blocks mentioned previously will cause an error (i.e. samples["image_id"].view(-1,1) is a list of strings, not a tensor of int).

abdel-habib commented 6 hours ago

Following up on the second issue, a minor modification that I tried worked when creating a custom captioning dataset, to return a unique numerical id based on the original implementation of self.img_ids = {} loop.

        return {
            "image": image,
            "text_input": caption,
            "image_id": self.img_ids[ann["image_id"]] # this is the main difference, return the unique numerical id
        }