salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence
BSD 3-Clause "New" or "Revised" License
9.8k stars 961 forks source link

Use pretrained Q-Former with multiple image resolutions #329

Open david-az opened 1 year ago

david-az commented 1 year ago

In the BLIP-2 paper, it is specified that: "[Q-Former] extracts a fixed number of output features from the image encoder, independent of input image resolution.".

However, when using cross-attention, this doesn't seem possible since it's using encoder_width which is fixed. I want to use Q-Former with a Pyramid Vision Transformer as the frozen image encoder. PVT handles multiple image resolutions and does not output a fixed feature resolution.

Is there a way to use cross-attention in that case ?

LiJunnan1992 commented 1 year ago

Cross-attention can deal with arbitrary sequence length (i.e. number of image patches), which is what the paper refers to. The Q-former does not natively support arbitrary feature size.