salesforce / BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BSD 3-Clause "New" or "Revised" License
4.86k stars 648 forks source link

Fine tune BLIP image retrieval for custom dataset without annotations #55

Open poipiii opened 2 years ago

poipiii commented 2 years ago

hi i would like to ask hows should I approach fine-tuning BLIP for image retrieval,my dataset contains a caption and image pair with no bounding box annotations, is it possible to train BLIP without annotations or should I create a bounding box of width/height = image width/height for each image

LiJunnan1992 commented 2 years ago

Hi, BLIP does not require bounding box input. You can try to use the entire image as input.

poipiii commented 2 years ago

Can you describe how would that work and how I should define the dataset for BLIP image retrieval fine tuning?

LiJunnan1992 commented 2 years ago

You can define the dataset following the same format as COCO

poipiii commented 2 years ago

oh i get it so define my dataset in a JSON file with this format as defined in the coco_karpathy dataset in the following way

{
caption:"example caption for image",
image:"001.png",
image_id:"001"
}