Open WindMarx opened 10 months ago
I encountered the same problem on vqarad model
I also face the same problem with SLAKE and VQARAD; I achieve around 64% accuracy on SLAKE and 35% on VQARAD. I would appreciate it if you could point out any issues with the weights, or if you are still able to reproduce the accuracy, then perhaps the mistake is on my end.
I downloaded the delta weights for Slake and added them to the weights of Llama as instructed. I evaluated it on the Slake test set, and the accuracy on the closed set was only 44.95%. Is there an issue with the provided download weights?
Hi! @WindMarx Could you reproduce the accuracy of SLAKE as reported in the paper? (~85%) I'd like to determine whether the issue lies on my end or with the provided weights. Best regards!
@LIMANONDBannapol Hello, can you please tell me how you evaluate it? When I use llava/eval/run_eval.py, it prompts me that cadidate.json is needed, but I can't find it. Do you have any other evaluation scripts?
@LIMANONDBannapol I also has the issue: @M3Dade :Hello, can you please tell me how you evaluate it? When I use llava/eval/run_eval.py, it prompts me that cadidate.json is needed, but I can't find it. Do you have any other evaluation scripts?
My test is also the same as the result, with only 62%accuracy on Slake.
In addition, I use the instructions of llava format in llava_med_in_text_60k_ckpt2_delta, which is fine -tuned, with only 40%accuracy. Obviously added a limited statement, or answered the sentence. But the same instructions fine -tune me can reach an accuracy of 80%on the Llava not llava-med.
I have the same problem with closed questions, maybe they used a different prompt in this case. I tried with different prompts, and I reached a yes/no accuracy of 60%
@dcampanini Hello, may I ask at which stage you used different prompts? Instruction fine-tuning?
I tried different prompts during inference when I was replicating the results, you can select different prompts when you execute the file llava/eval/model_vqa_med.py, there is a parameter --conv-mode that controls the prompts, and you can find the different options in llava/conversation.py
@dcampanini Hello, may I ask if you are able to reproduce a result close to the paper?
@JennDong I was able to get the same results for llava-med fine-tuned in the dataset VQA-RAD only for Open questions, I used the shared model "LLaVA-Med VQA-RAD-finetuned"
I've encountered the same issue. Whether I use the checkpoint provided by him or the one I fine-tuned, I can't reproduce the results from the paper. I'm achieving approximately 65% accuracy on VQA-Rad. Can anyone tell me if their results can be reproduced?
We warmly welcome everyone to test our model, LLaVa3-Med, a multimodal medical model based on Llama3.
Github: https://github.com/believewhat/LLaVa3-Med
Huggingface: https://huggingface.co/akemiH/LLaVa3-Med
I downloaded the delta weights for Slake and added them to the weights of Llama as instructed. I evaluated it on the Slake test set, and the accuracy on the closed set was only 44.95%. Is there an issue with the provided download weights?