Open abdel-habib opened 6 hours ago
Following up on the second issue, a minor modification that I tried worked when creating a custom captioning dataset, to return a unique numerical id based on the original implementation of self.img_ids = {}
loop.
return {
"image": image,
"text_input": caption,
"image_id": self.img_ids[ann["image_id"]] # this is the main difference, return the unique numerical id
}
While pre-training on a custom image-text dataset, I had some concerns with the implementation of both the CaptionDataset class and blip2_qformer.py file for handling the captioning datasets.
If you look at the
blip2_qformer.py
implementation, line 159, the if statement had this comment for using the image_id only for retrieval tasks, by checking the if the "image_id" is in the sample keys; Same with line 180.These two if statements trigger erros with custom image-text captioning dataset, idk how it didn't trigger an error using coco_caption_dataset.py as the
COCOCapDataset
is usingCaptionDataset
class implementation, and it is returning theimage_id
when getting an item.By commenting the if statement (True) blocks in line 159 and 180, the pre-training on stage 1 with custom datasets runs perfectly. Is this an expected behaviour or am I missing something?
Also, samples["image_id"] seems to be a list of strings, even with coco file naming pattern, when getting an item using the custom dataset implementation, it returns a string as an id, so anything inside the if (true) blocks mentioned previously will cause an error (i.e. samples["image_id"].view(-1,1) is a list of strings, not a tensor of int).