salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence
BSD 3-Clause "New" or "Revised" License
9.33k stars 924 forks source link

BLIP-2 input image size setting (image captioning) #532

Open Hangsiin opened 10 months ago

Hangsiin commented 10 months ago

in the BLIP-2 paper, "We propose Q-Former as the trainable module to bridge the gap between a frozen image encoder and a frozen LLM. It extracts a fixed number of output features from the image encoder, independent of input image resolution. "

I used the BLIP-2 model to train an image capturing model for my dataset, and there was actually no difference in training speed whether I used the input image size of 64 or 364 using the processor provided by HuggingFace.

So, when training an image capturing model, is it best to use the original input image size for BLIP-2 and not specify anything else?

shams2023 commented 9 months ago

in the BLIP-2 paper, "We propose Q-Former as the trainable module to bridge the gap between a frozen image encoder and a frozen LLM. It extracts a fixed number of output features from the image encoder, independent of input image resolution. "

I used the BLIP-2 model to train an image capturing model for my dataset, and there was actually no difference in training speed whether I used the input image size of 64 or 364 using the processor provided by HuggingFace.

So, when training an image capturing model, is it best to use the original input image size for BLIP-2 and not specify anything else?

Hi, brother! May I ask if you also do image captions? I also want to use blip2 to generate image caption for my dataset. Have you implemented it?