mlfoundations / open_flamingo

An open-source framework for training large multimodal models.
MIT License
3.74k stars 284 forks source link

Is context images really needed during few-shot inference? #278

Open zhang9302002 opened 1 year ago

zhang9302002 commented 1 year ago

Dear author,

I am reproducing few-shot image captioning task recently. I notice that in Flamingo and OpenFlamingo setting, one token can only attend to one previous image (or none). This means that, suppose we're performing a k-shot image caption, the newly generated token can only attend to the query image, and therefore the previous k context images can't be accessed anyhow. The generation process only depends on the query image, and the context (image, text) pair serve as text tokens and '' token only, not containing any visual information encoded.

I tried some experiments and found that using (image, text) as context, or (text) as context, this 2 setting seem have very similar CIDEr. I'm wondering if it means Flamingo few-shot inference only depends on k pure-text context, instead of (image, text) paired context? Or if I missed some details.

Thank you :)