microsoft / LLaVA-Med

Large Language-and-Vision Assistant for Biomedicine, built towards multimodal GPT-4 level capabilities.
Other
1.48k stars 181 forks source link

Confusion about the accuracy of the model #28

Open WindMarx opened 10 months ago

WindMarx commented 10 months ago

I downloaded the delta weights for Slake and added them to the weights of Llama as instructed. I evaluated it on the Slake test set, and the accuracy on the closed set was only 44.95%. Is there an issue with the provided download weights?

hellocym commented 10 months ago

I encountered the same problem on vqarad model

LIMANONDBannapol commented 10 months ago

I also face the same problem with SLAKE and VQARAD; I achieve around 64% accuracy on SLAKE and 35% on VQARAD. I would appreciate it if you could point out any issues with the weights, or if you are still able to reproduce the accuracy, then perhaps the mistake is on my end. image image

LIMANONDBannapol commented 10 months ago

I downloaded the delta weights for Slake and added them to the weights of Llama as instructed. I evaluated it on the Slake test set, and the accuracy on the closed set was only 44.95%. Is there an issue with the provided download weights?

Hi! @WindMarx Could you reproduce the accuracy of SLAKE as reported in the paper? (~85%) I'd like to determine whether the issue lies on my end or with the provided weights. Best regards!

M3Dade commented 9 months ago

@LIMANONDBannapol Hello, can you please tell me how you evaluate it? When I use llava/eval/run_eval.py, it prompts me that cadidate.json is needed, but I can't find it. Do you have any other evaluation scripts?

zhongzee commented 9 months ago

@LIMANONDBannapol I also has the issue: @M3Dade :Hello, can you please tell me how you evaluate it? When I use llava/eval/run_eval.py, it prompts me that cadidate.json is needed, but I can't find it. Do you have any other evaluation scripts?

jiongdemieshi commented 9 months ago

My test is also the same as the result, with only 62%accuracy on Slake.

jiongdemieshi commented 9 months ago

In addition, I use the instructions of llava format in llava_med_in_text_60k_ckpt2_delta, which is fine -tuned, with only 40%accuracy. Obviously added a limited statement, or answered the sentence. But the same instructions fine -tune me can reach an accuracy of 80%on the Llava not llava-med.

dcampanini commented 8 months ago

I have the same problem with closed questions, maybe they used a different prompt in this case. I tried with different prompts, and I reached a yes/no accuracy of 60%

zhongzee commented 8 months ago

@dcampanini Hello, may I ask at which stage you used different prompts? Instruction fine-tuning?

dcampanini commented 8 months ago

I tried different prompts during inference when I was replicating the results, you can select different prompts when you execute the file llava/eval/model_vqa_med.py, there is a parameter --conv-mode that controls the prompts, and you can find the different options in llava/conversation.py

JennDong commented 7 months ago

@dcampanini Hello, may I ask if you are able to reproduce a result close to the paper?

dcampanini commented 6 months ago

@JennDong I was able to get the same results for llava-med fine-tuned in the dataset VQA-RAD only for Open questions, I used the shared model "LLaVA-Med VQA-RAD-finetuned"

hjchen96 commented 5 months ago

I've encountered the same issue. Whether I use the checkpoint provided by him or the one I fine-tuned, I can't reproduce the results from the paper. I'm achieving approximately 65% accuracy on VQA-Rad. Can anyone tell me if their results can be reproduced?

believewhat commented 3 months ago

We warmly welcome everyone to test our model, LLaVa3-Med, a multimodal medical model based on Llama3.

Github: https://github.com/believewhat/LLaVa3-Med

Huggingface: https://huggingface.co/akemiH/LLaVa3-Med