Closed ahnjaewoo closed 1 year ago
As the models.py file say, there are two lines related with this settings.
embedding_dim = self.input_embeddings.embedding_dim * self.args.n_visual_tokens
self.visual_embeddings = nn.Linear(hidden_size, embedding_dim)
n_visual_tokens from 1 to 4, expand the .visual_embeddings size for captioning task.
The ability of visual dialogue, i.e. dialogue. Which structure ensure this ability. If it take from the few shot ability from facebook/opt model. If I simply replaced it by a pale GPT model (a simple gpt2 model) without any prompt guide few shot ability, Do you think it can works in visual dialogue scene ?
Thanks for the great research and effective multimodal model, "Fromage"!!!
I have a question regarding to... While reading the paper, I thought that the fromage model 'always' uses a single visual token (i.e., [ret]). Then I read this on the README page,
We have also included a second model trained with a stronger visual linear layer (4 visual tokens instead of 1), located at fromage_model/fromage_vis4. This model generally does better on dialogue settings and does not require as much tuning of inference time hyperparameters, as it is able to better represent more complex images.
What's the difference between the two models? Did you report results from fromage_vis4 model in any experiments? (e.g., visual dialogue)
Thanks!
I don't have the full quantitative results for the vis4 model yet, so the findings detailed there are mostly qualitative. I am not sure if this affects the quantitative results, but I expect that it should improve results on both VIST and VisDial. We hope to have some numbers soon for the next version of the paper!
Thanks for the great research and effective multimodal model, "Fromage"!!!
I have a question regarding to... While reading the paper, I thought that the fromage model 'always' uses a single visual token (i.e., [ret]). Then I read this on the README page,
What's the difference between the two models? Did you report results from fromage_vis4 model in any experiments? (e.g., visual dialogue)
Thanks!