haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.24k stars 2.24k forks source link

[Question] ScienceQA performance #237

Open wcy1122 opened 1 year ago

wcy1122 commented 1 year ago

Question

Hello, I tried to infer your released 13b checkpoint on scienceQA with the latest code. However, the accuracy is only around 40%, which is much lower than the reported 90.89%. Is there something wrong?

My convert command line

python3 -m llava.model.apply_delta \ --base /path/to/llama-13b-hf \ --target /path/to/LLaVA-13b-v0-science_qa \ --delta /path/to/LLaVA-13b-delta-v0

My inference command line

python -m llava.eval.model_vqa_science \ --model-name /path/to/LLaVA-13b-v0-science_qa \ --question-file /path/to/scienceqa/llava_test_QCM-LEPA.json \ --image-folder /path/to/scienceqa/images/test \ --answers-file /path/to/llava-13b-sqa-release/results/test_llava-13b.jsonl \ --answer-prompter \ --conv-mode simple

python -m llava.eval.eval_science_qa \ --base-dir /path/to/scienceqa \ --result-file /path/to/llava-13b-sqa-release/results/test_llava-13b.jsonl \ --output-file /path/to/llava-13b-sqa-release/results/test_llava-13b_output.json \ --output-result /path/to/llava-13b-sqa-release/results/test_llava-13b_result.json

haotian-liu commented 1 year ago

Hi @wcy1122

One other user found that re-download the correct checkpoints resolve the similar issue in #104.

Can you make sure that: (1) you downloaded the correct ScienceQA delta; (2) you applied the delta weights to get the correct model weights; (3) the base model weights during the conversion mentioned in step (2) is LLaMA instead of Vicuna.

Thanks.

wcy1122 commented 1 year ago

Hi @haotian-liu , thanks for your reply. It looks weird. I use llama 13b from https://huggingface.co/decapoda-research/llama-13b-hf, and download the delta weight from https://huggingface.co/liuhaotian/LLaVA-13b-delta-v0-science_qa.

wcy1122 commented 1 year ago

Hi. I found that in your release result file here, almost all outputs contain "Assistant:" in the front. But when I inference your release checkpoint, I found that only half of the output contains "Assistant:" in the front. In most cases, the model directly output "\n The answer is A.". I guess it's like something wrong with the inference prompt?

CupidJay commented 1 year ago

Hi, I met the same problem, how did you solve it?