kohjingyu / fromage

🧀 Code and models for the ICML 2023 paper "Grounding Language Models to Images for Multimodal Inputs and Outputs".
https://jykoh.com/fromage
Apache License 2.0
466 stars 34 forks source link

Should the last_embedding_idx = caption - 2 ? #4

Closed sijeh closed 1 year ago

sijeh commented 1 year ago

https://github.com/kohjingyu/fromage/blob/2652cc647339aec32d6ef8be7cbf51e7d9fc341f/fromage/models.py#L184

Hello kohjingyu, thanks for your great work!

I'm a bit confused about the variable last_embedding_idx in models.py. The input caption seems like [..., [RET], [EOS]], therefore the caption_len - 1 refers to the index of [EOS], thus the last_hidden_state[i, last_embedding_idx[i], :] indicates the output hidden state of token [EOS] which is used to retrieve images. Is there anything wrong? Please point out if I misunderstood the intent.

Best regards.

kohjingyu commented 1 year ago

Thanks for your interest! This is because the OPT models and tokenizer do not add the [EOS] tokens by default. Hence, the last token is [RET] during retrieval training, which his why it's caption_len - 1. Hence the comment there is a bit misleading, which I apologize for!

sijeh commented 1 year ago

Thanks for your interest! This is because the OPT models and tokenizer do not add the [EOS] tokens by default. Hence, the last token is [RET] during retrieval training, which his why it's caption_len - 1. Hence the comment there is a bit misleading, which I apologize for!

I see, thanks for your kind reply.