Open poipiii opened 2 years ago
Hi, BLIP does not require bounding box input. You can try to use the entire image as input.
Can you describe how would that work and how I should define the dataset for BLIP image retrieval fine tuning?
You can define the dataset following the same format as COCO
oh i get it so define my dataset in a JSON file with this format as defined in the coco_karpathy dataset in the following way
{
caption:"example caption for image",
image:"001.png",
image_id:"001"
}
hi i would like to ask hows should I approach fine-tuning BLIP for image retrieval,my dataset contains a caption and image pair with no bounding box annotations, is it possible to train BLIP without annotations or should I create a bounding box of width/height = image width/height for each image