kbressem / medAlpaca

LLM finetuned for medical question answering
GNU General Public License v3.0
484 stars 56 forks source link

Getting different scores for medalpaca 7B and medalpaca 13b #47

Open anand-subu opened 1 year ago

anand-subu commented 1 year ago

Hi there!

Great work with medalpaca! I was trying to reproduce your scores on the USMLE eval sets for medalpaca 7B and medalpaca 13B. However, when I run the notebook shared in #40, I'm getting the following scores:

scores

To double verify I also calculated the scores directly myself, by ignoring the questions with images and I'm getting the same scores as the notebook.

I ran the eval code as follows:

python eval_usmle.py     --model_name medalpaca/medalpaca-7b     --prompt_template ../medalpaca/prompt_templates/medalpaca.json    --base_model False     --peft False   --load_in_8bit False  --path_to_exams medical_meadow_usmle_self_assessment
python eval_usmle.py     --model_name medalpaca/medalpaca-13b     --prompt_template ../medalpaca/prompt_templates/medalpaca.json    --base_model False     --peft False   --load_in_8bit False  --path_to_exams medical_meadow_usmle_self_assessment

But I'm still seeing considerable differences. The medalpaca 7B is quite close to its reported scores on the github readme but not so for medalpaca-13B. Could you let me know if I might be doing something wrong on my side?

Thank you!

samuelvkwong commented 8 months ago

I am also having trouble reproducing the scores on the USMLE eval sets for medalpaca 13b, following the same steps as noted above. I had to make changes to the notebook used to compute the scores because there is a discrepancy between how the notebook assumes the answers in the generated json files should look like and what they actually look like. In the evaluation script, if the response is not in the correct format, the LM is prompted a maximum of 5 times with the final response being saved as the answer (all previous responses are not saved in the answer). https://github.com/kbressem/medAlpaca/blob/63448c57967359ee04e6408ae418418ba0ac9f3a/eval/eval_usmle.py#L169 In the scoring notebook, it assumes that all responses up to the final answer are saved as "answer_1", "answer_2", etc. which does not match the set up in the evaluation script where only "answer" contains the final answer. Here is the adjusted notebook: eval_usmle_edited.ipynb.zip Here are the scores:

Screen Shot 2024-02-15 at 2 58 20 PM

When I took a look at the generated answers in the json files, I noticed that a lot of the answers were unintelligible (answer did not start with a letter option, or multiple letter options were given). So I prepared my own evaluation script aiming to fix what I thought was formatting errors. In my script I use the medalpaca-13b model in a HuggingFace pipeline and do fewshotting to more likely get the answer in the correct format. To ensure that the answer is in the correct format, I also pass the answer as well as the list of options to an OpenAI LM instance, asking it to select the answer option that is closest to the provided answer. With my script, the EM scores for step 1, step 2, and step 3 are 0.250, 0.257, and 0.290 respectively. There is improvement (most likely from solving the formatting issues) but still quite different from the reported scores for medalpaca-13b.

From what I read in your paper, the reported scores were achieved with zeroshot and no additional prompting techniques. Could you let me know if there is something I am missing or if you've been able to reproduce the scores recently?

jzy-dyania commented 5 months ago

I tried inference with huggingface pipeline, which gave higher results, but still 4~7% lower than the reported USMLE scores. Does anyone have idea about the discrepancy between the two methods?