mlpc-ucsd / BLIVA

(AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
https://arxiv.org/abs/2308.09936
BSD 3-Clause "New" or "Revised" License
268 stars 27 forks source link

performance on the VizWiz dataset #20

Open qwqwq1445 opened 9 months ago

qwqwq1445 commented 9 months ago

I load your pretrained model weights and utilize your default parameters to conduct an evaluation on the VizWiz dataset. However, the performance I get without any prompt template is around 28.00. This result is far from the results in your paper. Could you tell me what's wrong here? Maybe the default parameters?

gordonhu608 commented 9 months ago

The default prompt for Vizwiz is "Question: {} Short answer: ". If this still didn't give satisfying performance. Then there could be other problems. Btw, another popular prompt which is employed by LLaVA is "When the provided information is insufficient, respond with ‘Unanswerable’. Answer the question using a single word or phrase". Although we never tried this, it could possibily leads to better performance.

qwqwq1445 commented 9 months ago

The default prompt for Vizwiz is "Question: {} Short answer: ". If this still didn't give satisfying performance. Then there could be other problems. Btw, another popular prompt which is employed by LLaVA is "When the provided information is insufficient, respond with ‘Unanswerable’. Answer the question using a single word or phrase". Although we never tried this, it could possibily leads to better performance.

THANKS for your apply! Have you ever tried to finetune BLIVA on a single dataset? If doing so, shall we use the prompt pool for this single dataset? And do you have any recommended hyperparameters for down-stream fine-tuning?

gordonhu608 commented 9 months ago

No, we didn't finetune on any specific task. But some suggestions are 1) try both keep the same prompt for all questions and various prompts, compare which one is better. 2) for example learning rate can usually start at 2e-5 or 1e-5 and again check which one is suitable for your training setting.

qwqwq1445 commented 9 months ago

No, we didn't finetune on any specific task. But some suggestions are 1) try both keep the same prompt for all questions and various prompts, compare which one is better. 2) for example learning rate can usually start at 2e-5 or 1e-5 and again check which one is suitable for your training setting.

It seems that fine-tuning BLIVA is similar to fine-tuning BLIP2, maybe I can just use the hyperparameters provided by BLIP2? By the way, in zero-shot inference VQA of BLIP2, they said that they "set the length-penalty to -1". But your default length-penalty is 0. Do you have any experience with this issue?

qwqwq1445 commented 9 months ago

There is two stages for pretraining BLIVA. But I can't find many details about the first stage in your paper. I wonder if you use the pretrained weights of BLIP2 or InstructBLIP2 as your stage1 model weights?

gordonhu608 commented 9 months ago

It's the InstructBLIP2 weight as our stage1.