Open vlievin opened 2 years ago
Nope, but you can use 60 out f 100 score as a reference score, which is the passing score for the Med Exam.
Thank you for the info! So, just to make sure we are aligned here: does that means 60% answering accuracy for the US, TW and MC datasets?
Yeap, 60% of accuracy can be considered as the human passing score. For US dataset, this score is quite hard.
Yeap, 60% of accuracy can be considered as the human passing score. For US dataset, this score is quite hard.
Can you elaborate on your reasoning here please? It seems a somewhat flawed heuristic to assume human performance on this particular QA dataset [which to my understanding is generated?] would be approximately equivalent to the outcome of humans being tested in actual exams.
Hi! Is there a known human baseline for this dataset (open and closed book)? Or maybe the required score to pass the exam?