kohjingyu / fromage

🧀 Code and models for the ICML 2023 paper "Grounding Language Models to Images for Multimodal Inputs and Outputs".
https://jykoh.com/fromage
Apache License 2.0
466 stars 34 forks source link

What is "fromage_vis4" model? #8

Closed ahnjaewoo closed 1 year ago

ahnjaewoo commented 1 year ago

Thanks for the great research and effective multimodal model, "Fromage"!!!

I have a question regarding to... While reading the paper, I thought that the fromage model 'always' uses a single visual token (i.e., [ret]). Then I read this on the README page,

We have also included a second model trained with a stronger visual linear layer (4 visual tokens instead of 1), located at fromage_model/fromage_vis4. This model generally does better on dialogue settings and does not require as much tuning of inference time hyperparameters, as it is able to better represent more complex images.

What's the difference between the two models? Did you report results from fromage_vis4 model in any experiments? (e.g., visual dialogue)

Thanks!

svjack commented 1 year ago

As the models.py file say, there are two lines related with this settings.

embedding_dim = self.input_embeddings.embedding_dim * self.args.n_visual_tokens

self.visual_embeddings = nn.Linear(hidden_size, embedding_dim)

n_visual_tokens from 1 to 4, expand the .visual_embeddings size for captioning task.

svjack commented 1 year ago

The ability of visual dialogue, i.e. dialogue. Which structure ensure this ability. If it take from the few shot ability from facebook/opt model. If I simply replaced it by a pale GPT model (a simple gpt2 model) without any prompt guide few shot ability, Do you think it can works in visual dialogue scene ?

kohjingyu commented 1 year ago

Thanks for the great research and effective multimodal model, "Fromage"!!!

I have a question regarding to... While reading the paper, I thought that the fromage model 'always' uses a single visual token (i.e., [ret]). Then I read this on the README page,

We have also included a second model trained with a stronger visual linear layer (4 visual tokens instead of 1), located at fromage_model/fromage_vis4. This model generally does better on dialogue settings and does not require as much tuning of inference time hyperparameters, as it is able to better represent more complex images.

What's the difference between the two models? Did you report results from fromage_vis4 model in any experiments? (e.g., visual dialogue)

Thanks!

I don't have the full quantitative results for the vis4 model yet, so the findings detailed there are mostly qualitative. I am not sure if this affects the quantitative results, but I expect that it should improve results on both VIST and VisDial. We hope to have some numbers soon for the next version of the paper!