kohjingyu / fromage

🧀 Code and models for the ICML 2023 paper "Grounding Language Models to Images for Multimodal Inputs and Outputs".
https://jykoh.com/fromage
Apache License 2.0
466 stars 34 forks source link

Concatenating two captions in retrieval mode #5

Closed jeasinema closed 1 year ago

jeasinema commented 1 year ago

In the paper, concatenating two caption in retrieval only negatively affects the performance, maybe we should just remove it? Also, I couldn't find all_last_embedding_idx anywhere else in this file.

https://github.com/kohjingyu/fromage/blob/2652cc647339aec32d6ef8be7cbf51e7d9fc341f/fromage/models.py#L320

kohjingyu commented 1 year ago

Thanks for letting me know. This was part of a piece of code that was supposed to be removed in the final version, but I didn't. It's been removed now. Thanks!

jeasinema commented 1 year ago

TY! Just one quick follow up: in Fig. 2 of the paper, FROMAGe seems to also have a "captioning" loss for ar text-generation in retrieval, I'm wondering if this is useful? as the LM is basically frozen and the learnable [RET] token is at the very end.

kohjingyu commented 1 year ago

The captioning loss (or rather, the next-token prediction loss, since we don't have an image input) is useful because it trains the model to produce [RET]. Without this, the model will never produce [RET] during inference time with greedy/nucleus sampling, because it is a new token (and the pretrained LLM never saw it).

jeasinema commented 1 year ago

Got it. So the point here is to encourage the model to proactively retrieve some images (as in the demo fig.1)? I'm quite interested in how you create such training data. I'm guessing something like

Show me some pictures of a sparrow. Here are some photos of a sparrow [RET].
kohjingyu commented 1 year ago

Correct. This is described in detail in Sec 3.2 of the paper, but basically we just append [RET] to the end of every caption: (there are probably better ways to do this)

https://github.com/kohjingyu/fromage/blob/38cc18f010610f3e733c5fa1f76e6107803c6718/fromage/data.py#L107

jeasinema commented 1 year ago

Thanks for letting me know. Yes, this is indeed forcing the model to emit a [RET] at the end all the time. But at this point it seems quite reasonable.