About the accuracy on VQA-RAD

luzimu commented 10 months ago

Thanks for you marvelous work! I encountered some issues while evaluating the model on VQA-RAD. The checkpoint was downloaded from the URL https://www.dropbox.com/s/sbhohrc0vsxio8r/vqa_rad.pt?dl=0, and we executed the 'evaluate_vqa_rad_beam.sh' script as instructed. However, I noticed in the corresponding log that the accuracy on VQA-RAD is 0.3338, which is lower than the 71.6 reported in the paper. Could you please provide some clarification regarding this matter? Here is the reserved log: 2023-10-18 16:50:16 | INFO | ofa.evaluate | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': 'simple', 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 7, 'cpu': False, 'tpu': False, 'bf16':(...) 2023-10-18 16:50:16 | INFO | ofa.evaluate | loading model(s) from ../../checkpoints/vqa_rad.pt 2023-10-18 16:50:20 | INFO | tasks.ofa_task | source dictionary: 59457 types 2023-10-18 16:50:20 | INFO | tasks.ofa_task | target dictionary: 59457 types local datafile ../../datasets/finetuning/VQA-RAD/test.tsv slice_id 0 begin to initialize row_count and line_idx-to-offset mapping local datafile ../../datasets/finetuning/VQA-RAD/test.tsv slice_id 0 finished initializing row_count and line_idx-to-offset mapping file ../../datasets/finetuning/VQA-RAD/test.tsv slice_id 0 row count 451 total row count 451 2023-10-18 16:51:20 | INFO | ofa.evaluate | loading EMA weights from ../../checkpoints/vqa_rad.pt 2023-10-18 16:54:58 | INFO | ofa.evaluate | score_sum: tensor([137.], device='cuda:0'), score_cnt: tensor([451.], device='cuda:0'), score: 0.3038

taokz commented 10 months ago

Hi @luzimu

The issue arises from the fact that the generated answers are in lowercase, while the gold standard answers are in uppercase. To illustrate, BiomedGPT outputs 'yes,' whereas the gold answer is 'Yes.' Consequently, the reported results in our logs may not accurately reflect the performance.

To address this, I have been using the 'test_predict.json' file generated during evaluation and calculating accuracy by converting all uppercase characters in the gold answers to lowercase, and get the result reported in the paper.

Thank you for bringing this to my attention that I ignored previously. I will upload the relevant code and files later this week.

jtdutta1 commented 8 months ago

Hi has this been fixed because I too am facing the same issue. Also i noticed that the ground truth in the dataset is in lowercase so i didn't find that to be the cause for lower accuracy. Moreover the model hallucinates a lot like answering yes/no questions with other answers. The accuracy points tends to be close to the zero-shot accuracy reported in the paper and adjusting the hyperparameters does get me to 33.04%. Could it be that the actual fine tuned model wasn't uploaded mistakenly?

taokz commented 8 months ago

@jtdutta1 I will double-check the checkpoint and the code and fix it ASAP.

taokz commented 8 months ago

@jtdutta1 @luzimu

Thank you for your patience. I confirmed that I uploaded the wrong checkpoint. It seems that I can not find the previous one, but fortunately, I have the other checkpoint that can achieve a better performance, as shown in the following. I've shared the weights by Dropbox.

2023-12-08 13:22:19 | INFO | ofa.evaluate | loading EMA weights from /data/biomedgpt/tuned_checkpoints/VQA-RAD/base/100_0.04_5e-5_384_/checkpoint_best.pt
/home/kaz321/OFA/data/mm_data/vqa_gen_dataset.py:64: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  decoder_prompts = np.array([s['decoder_prompt'].tolist() for s in samples])
2023-12-08 13:22:30 | INFO | ofa.evaluate | score_sum: tensor([332.], device='cuda:0'), score_cnt: tensor([451.], device='cuda:0'), score: 0.7361
2023-12-08 13:22:30 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:2 to store for rank: 0
2023-12-08 13:22:30 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
/home/kaz321/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/distributed/launch.py:188: FutureWarning: The module torch.distributed.launch is deprecated

If you have further questions, please let me know. I appreciate your patience again.

99Franklin commented 6 months ago

@jtdutta1 @luzimu

Thank you for your patience. I confirmed that I uploaded the wrong checkpoint. It seems that I can not find the previous one, but fortunately, I have the other checkpoint that can achieve a better performance, as shown in the following. I've shared the weights by Dropbox.

2023-12-08 13:22:19 | INFO | ofa.evaluate | loading EMA weights from /data/biomedgpt/tuned_checkpoints/VQA-RAD/base/100_0.04_5e-5_384_/checkpoint_best.pt
/home/kaz321/OFA/data/mm_data/vqa_gen_dataset.py:64: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  decoder_prompts = np.array([s['decoder_prompt'].tolist() for s in samples])
2023-12-08 13:22:30 | INFO | ofa.evaluate | score_sum: tensor([332.], device='cuda:0'), score_cnt: tensor([451.], device='cuda:0'), score: 0.7361
2023-12-08 13:22:30 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:2 to store for rank: 0
2023-12-08 13:22:30 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
/home/kaz321/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/distributed/launch.py:188: FutureWarning: The module torch.distributed.launch is deprecated

If you have further questions, please let me know. I appreciate your patience again.

Hi, the download link is expired, could you please provide a new link?

taokz commented 5 months ago

@99Franklin Sorry that I just notice your comment, please use this link to download the checkpoint.

luzimu commented 5 months ago

Thank you for your detailed and patient reply. It greatly resolved my issue.

taokz / BiomedGPT

About the accuracy on VQA-RAD #6