haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.24k stars 2.24k forks source link

[Question] Why execute eval ScienceQA When sciqa train dataset is used, different scores will be obtained? #199

Open mary-0830 opened 1 year ago

mary-0830 commented 1 year ago

Question

Hello, I have two questions: 1. I used the same jsonl results, but the scores evaluated were different, the results are shown below. 2023-05-29 09:07:24.546966: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variableTF_ENABLE_ONEDNN_OPTS=0`. 2023-05-29 09:07:24.594253: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-29 09:07:25.299057: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Total: 4241, Correct: 1776, Accuracy: 41.88%

2023-05-29 09:07:33.273008: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2023-05-29 09:07:33.320623: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-29 09:07:34.029075: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Total: 4241, Correct: 1732, Accuracy: 40.84%

2023-05-29 09:07:41.954175: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2023-05-29 09:07:42.000912: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-29 09:07:42.703589: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Total: 4241, Correct: 1690, Accuracy: 39.85%

2023-05-29 09:07:50.292305: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2023-05-29 09:07:50.340043: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-29 09:07:51.039918: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Total: 4241, Correct: 1716, Accuracy: 40.46%`

2. Why did I use scienceQA to add to the training for finetune, but the evaluation score still did not reach 90%?

Looking forward to your prompt reply, thank you~

haotian-liu commented 1 year ago

Hi @mary-0830

One other user found that re-download the correct checkpoints resolve the similar issue in #104.

Can you make sure that: (1) you downloaded the correct ScienceQA delta; (2) you applied the delta weights to get the correct model weights; (3) the base model weights during the conversion mentioned in step (2) is LLaMA instead of Vicuna.

Thanks.

mary-0830 commented 1 year ago

Hi @mary-0830

One other user found that re-download the correct checkpoints resolve the similar issue in #104.

Can you make sure that: (1) you downloaded the correct ScienceQA delta; (2) you applied the delta weights to get the correct model weights; (3) the base model weights during the conversion mentioned in step (2) is LLaMA instead of Vicuna.

Thanks.

Thank you for your reply! Perhaps I didn't make it clear, both of these issues were based on our own fine-tuning of the situation. Firstly, when we use our own fine-tuning model for evaluation, such situations may occur; Secondly, after adding the training set of Sciqa, the trained model does not seem to be as high as 90%.

haotian-liu commented 1 year ago

Hi @mary-0830

As stated in the paper, the results (90.9%) are obtained after (1) feature alignment of the first stage; (2) finetuning on ScienceQA dataset.

Would you share more details about your training steps, and in which stage do you obtain the accuracy mentioned above? also, what's your accuracy "after adding the training set of Sciqa"? Thanks.

mary-0830 commented 1 year ago

Hi @mary-0830

As stated in the paper, the results (90.9%) are obtained after (1) feature alignment of the first stage; (2) finetuning on ScienceQA dataset.

Would you share more details about your training steps, and in which stage do you obtain the accuracy mentioned above? also, what's your accuracy "after adding the training set of Sciqa"? Thanks.

Thank you for your reply. I have another question, why the results of test_sqa_llava_13b_v0.json and the answer of llava_test_QCM-LEPA.json the same? Is it because the model completely predicts?

I'm not very familiar with the training details, but I know that I should have included all the training set data from scienceQA.

haotian-liu commented 1 year ago

@mary-0830 They are not the same (as it still fails on some questions), but they can be visually very similar -- the main reason is that the ScienceQA has the "lecture" section that we use for CoT prediction, and the lectures are the same for questions of the same type. This means that when the model correctly identifies the knowledge that the question requires, it will retrieve the lecture that it has seen during training, which will also be included in the answer template in the GT answer. You'll notice difference in the final answer, and the actual problem solving part.