[Question] Impossible to reproduce results, model performs poorly

llSourcell / Doctor-Dignity

Doctor Dignity is an LLM that can pass the US Medical Licensing Exam. It works offline, it's cross-platform, & your health data stays private.

Apache License 2.0

3.84k stars 403 forks source link

[Question] Impossible to reproduce results, model performs poorly #32

Open maximegmd opened 1 year ago

maximegmd commented 1 year ago

❓ General Questions

I evaluated the model using lm-evaluation-harness on MedMCQA, MedQA-USMLE and PubMedQA and the model performs barely above llama2 7b with only 38% on the USMLE, 36% on MedMCQa and 73.9% on PubMedQA.

Could you describe how you got your results?

s1ghhh commented 1 year ago

emmmm, me too.

llSourcell commented 1 year ago

Hey Doc! The evaluation function i used is in the .ipynb attached in the repository. I created a semantic similarity threshold for all responses congruent with possible responses in the USMLE. So it doesn't have to be a verbatim response, thus the accuracy was higher. Also, i am about to release a new fine-tuned model next week. the goal here is to keep on improving. i just merged my first PR. posted a paid bounty last week for UI issues. would love your help!