Use pretrained Q-Former with multiple image resolutions

In the BLIP-2 paper, it is specified that: "[Q-Former] extracts a fixed number of output features from the image encoder, independent of input image resolution.".

However, when using cross-attention, this doesn't seem possible since it's using encoder_width which is fixed. I want to use Q-Former with a Pyramid Vision Transformer as the frozen image encoder. PVT handles multiple image resolutions and does not output a fixed feature resolution.

Is there a way to use cross-attention in that case ?

salesforce / LAVIS

Use pretrained Q-Former with multiple image resolutions #329