In the BLIP-2 paper, it is specified that: "[Q-Former] extracts a fixed number of output features from the image
encoder, independent of input image resolution.".
However, when using cross-attention, this doesn't seem possible since it's using encoder_width which is fixed.
I want to use Q-Former with a Pyramid Vision Transformer as the frozen image encoder. PVT handles multiple image resolutions and does not output a fixed feature resolution.
Is there a way to use cross-attention in that case ?
Cross-attention can deal with arbitrary sequence length (i.e. number of image patches), which is what the paper refers to. The Q-former does not natively support arbitrary feature size.
In the BLIP-2 paper, it is specified that: "[Q-Former] extracts a fixed number of output features from the image encoder, independent of input image resolution.".
However, when using cross-attention, this doesn't seem possible since it's using
encoder_width
which is fixed. I want to use Q-Former with a Pyramid Vision Transformer as the frozen image encoder. PVT handles multiple image resolutions and does not output a fixed feature resolution.Is there a way to use cross-attention in that case ?