salesforce / BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BSD 3-Clause "New" or "Revised" License
4.53k stars 606 forks source link

Batch predictions Image Captioning task #58

Open MikeMACintosh opened 2 years ago

MikeMACintosh commented 2 years ago

Hi, glad to see and use this cool project, thanks you. I have a question: if it possible to batch predictions on Image captioning task? I see https://github.com/salesforce/BLIP/issues/48 but it's not my case.

i do something like:

base_model_path = 'path_to_base_model' model_base = blip_decoder(pretrained=base_model_path, vit='base', image_size=IMAGE_SIZE) model_base.eval() model_base.to(device)

img = transform(sample).unsqueeze(0).to(device) with torch.no_grad(): caption_bs_base=model_base.generate(img, sample=False, num_beams=7, max_length=16, min_length=5)

It works good, but i want to inference 4 models(vit base/large and beam search/nucleus sampling) and it's to long. On my server signature 12 pictures 4 models takes ~34 sec (12*4 = 48 signature).

Thanks you.

LiJunnan1992 commented 2 years ago

Yes you can do batch inference.

MikeMACintosh commented 2 years ago

@LiJunnan1992 Сould you explain how i can do that? Should I write my own Dataloader?

poipiii commented 2 years ago

yes you have to write your own data loader I just done it myself