Is context images really needed during few-shot inference?

Dear author,

I am reproducing few-shot image captioning task recently. I notice that in Flamingo and OpenFlamingo setting, one token can only attend to one previous image (or none). This means that, suppose we're performing a k-shot image caption, the newly generated token can only attend to the query image, and therefore the previous k context images can't be accessed anyhow. The generation process only depends on the query image, and the context (image, text) pair serve as text tokens and '' token only, not containing any visual information encoded.

I tried some experiments and found that using (image, text) as context, or (text) as context, this 2 setting seem have very similar CIDEr. I'm wondering if it means Flamingo few-shot inference only depends on k pure-text context, instead of (image, text) paired context? Or if I missed some details.

Thank you :)

mlfoundations / open_flamingo

Is context images really needed during few-shot inference? #278