microsoft / LLaVA-Med

Large Language-and-Vision Assistant for Biomedicine, built towards multimodal GPT-4 level capabilities.
Other
1.29k stars 148 forks source link

Bug in evaluation code? #12

Open hellocym opened 8 months ago

hellocym commented 8 months ago

Great works on medical VQA!

https://github.com/microsoft/LLaVA-Med/blob/356ba559f471af61fc0e95873bdbbf40705dabc6/llava/eval/run_eval_pvqa.py#L86-L93

It seems that if a word such as 'normal' or 'note' is in the answer generated by the model, and the ground truth is 'No', the script would count that as a correct answer. This would lead to a higher closed accuracy score.

M3Dade commented 4 months ago

if 'yes' in pred_value or 'no' in pred_value:

It seems that if both yes and no appear in the resulting answer, the answer must be correct.