Issue related to BLIP2 `CaptionDataset` implementation or `blip2_qformer.py` for custom dataset pre-training stage 1

While pre-training on a custom image-text dataset, I had some concerns with the implementation of both the CaptionDataset class and blip2_qformer.py file for handling the captioning datasets.

If you look at the blip2_qformer.py implementation, line 159, the if statement had this comment for using the image_id only for retrieval tasks, by checking the if the "image_id" is in the sample keys; Same with line 180.

if "image_id" in samples.keys(): #coco retrieval finetuning
      image_ids = samples["image_id"].view(-1,1)
            ...
      loss_itc = ...
else:                     
      loss_itc = ...

These two if statements trigger erros with custom image-text captioning dataset, idk how it didn't trigger an error using coco_caption_dataset.py as the COCOCapDataset is using CaptionDataset class implementation, and it is returning the image_id when getting an item.

By commenting the if statement (True) blocks in line 159 and 180, the pre-training on stage 1 with custom datasets runs perfectly. Is this an expected behaviour or am I missing something?

Also, samples["image_id"] seems to be a list of strings, even with coco file naming pattern, when getting an item using the custom dataset implementation, it returns a string as an id, so anything inside the if (true) blocks mentioned previously will cause an error (i.e. samples["image_id"].view(-1,1) is a list of strings, not a tensor of int).

salesforce / LAVIS

Issue related to BLIP2 `CaptionDataset` implementation or `blip2_qformer.py` for custom dataset pre-training stage 1 #772