simonw / llm-evals-plugin

Run evals using LLM
19 stars 0 forks source link

Run a subset of MMLU #8

Open simonw opened 2 months ago

simonw commented 2 months ago

Running a subset of MMLU would be a great proof of concept for this tool.

https://github.com/hendrycks/test - but you have to download a 158M TAR from https://people.eecs.berkeley.edu/~hendrycks/data.tar

simonw commented 2 months ago

The content of that TAR:

data
data/possibly_contaminated_urls.txt
data/test
data/test/moral_scenarios_test.csv
data/test/us_foreign_policy_test.csv
data/test/public_relations_test.csv
data/test/global_facts_test.csv
data/test/electrical_engineering_test.csv
data/test/astronomy_test.csv
data/test/business_ethics_test.csv
data/test/jurisprudence_test.csv
data/test/high_school_chemistry_test.csv
data/test/college_physics_test.csv
data/test/professional_psychology_test.csv
data/test/marketing_test.csv
data/test/management_test.csv
data/test/virology_test.csv
data/test/international_law_test.csv
data/test/high_school_macroeconomics_test.csv
data/test/prehistory_test.csv
data/test/abstract_algebra_test.csv
data/test/high_school_physics_test.csv
data/test/formal_logic_test.csv
data/test/college_medicine_test.csv
data/test/high_school_us_history_test.csv
data/test/moral_disputes_test.csv
data/test/high_school_european_history_test.csv
data/test/clinical_knowledge_test.csv
data/test/world_religions_test.csv
data/test/high_school_microeconomics_test.csv
data/test/professional_law_test.csv
data/test/human_aging_test.csv
data/test/medical_genetics_test.csv
data/test/high_school_geography_test.csv
data/test/high_school_government_and_politics_test.csv
data/test/anatomy_test.csv
data/test/sociology_test.csv
data/test/logical_fallacies_test.csv
data/test/high_school_computer_science_test.csv
data/test/miscellaneous_test.csv
data/test/high_school_world_history_test.csv
data/test/professional_medicine_test.csv
data/test/high_school_biology_test.csv
data/test/high_school_statistics_test.csv
data/test/college_chemistry_test.csv
data/test/nutrition_test.csv
data/test/econometrics_test.csv
data/test/human_sexuality_test.csv
data/test/security_studies_test.csv
data/test/philosophy_test.csv
data/test/elementary_mathematics_test.csv
data/test/college_biology_test.csv
data/test/college_computer_science_test.csv
data/test/machine_learning_test.csv
data/test/professional_accounting_test.csv
data/test/college_mathematics_test.csv
data/test/high_school_mathematics_test.csv
data/test/high_school_psychology_test.csv
data/test/conceptual_physics_test.csv
data/test/computer_security_test.csv
data/auxiliary_train
data/auxiliary_train/obqa.csv
data/auxiliary_train/science_elementary.csv
data/auxiliary_train/arc_easy.csv
data/auxiliary_train/aux_law_90s.csv
data/auxiliary_train/mc_test.csv
data/auxiliary_train/race.csv
data/auxiliary_train/science_middle.csv
data/auxiliary_train/arc_hard.csv
data/dev
data/dev/prehistory_dev.csv
data/dev/formal_logic_dev.csv
data/dev/conceptual_physics_dev.csv
data/dev/moral_scenarios_dev.csv
data/dev/high_school_macroeconomics_dev.csv
data/dev/clinical_knowledge_dev.csv
data/dev/electrical_engineering_dev.csv
data/dev/high_school_us_history_dev.csv
data/dev/computer_security_dev.csv
data/dev/international_law_dev.csv
data/dev/logical_fallacies_dev.csv
data/dev/business_ethics_dev.csv
data/dev/high_school_psychology_dev.csv
data/dev/professional_accounting_dev.csv
data/dev/management_dev.csv
data/dev/medical_genetics_dev.csv
data/dev/world_religions_dev.csv
data/dev/high_school_chemistry_dev.csv
data/dev/high_school_government_and_politics_dev.csv
data/dev/high_school_computer_science_dev.csv
data/dev/high_school_microeconomics_dev.csv
data/dev/econometrics_dev.csv
data/dev/high_school_world_history_dev.csv
data/dev/nutrition_dev.csv
data/dev/us_foreign_policy_dev.csv
data/dev/global_facts_dev.csv
data/dev/human_aging_dev.csv
data/dev/anatomy_dev.csv
data/dev/abstract_algebra_dev.csv
data/dev/astronomy_dev.csv
data/dev/public_relations_dev.csv
data/dev/human_sexuality_dev.csv
data/dev/high_school_biology_dev.csv
data/dev/college_computer_science_dev.csv
data/dev/high_school_physics_dev.csv
data/dev/college_mathematics_dev.csv
data/dev/high_school_mathematics_dev.csv
data/dev/professional_law_dev.csv
data/dev/high_school_statistics_dev.csv
data/dev/miscellaneous_dev.csv
data/dev/college_medicine_dev.csv
data/dev/professional_psychology_dev.csv
data/dev/college_biology_dev.csv
data/dev/college_physics_dev.csv
data/dev/elementary_mathematics_dev.csv
data/dev/moral_disputes_dev.csv
data/dev/philosophy_dev.csv
data/dev/high_school_geography_dev.csv
data/dev/marketing_dev.csv
data/dev/virology_dev.csv
data/dev/jurisprudence_dev.csv
data/dev/sociology_dev.csv
data/dev/college_chemistry_dev.csv
data/dev/professional_medicine_dev.csv
data/dev/high_school_european_history_dev.csv
data/dev/security_studies_dev.csv
data/dev/machine_learning_dev.csv
data/README.txt
data/val
data/val/security_studies_val.csv
data/val/machine_learning_val.csv
data/val/college_chemistry_val.csv
data/val/professional_medicine_val.csv
data/val/high_school_european_history_val.csv
data/val/jurisprudence_val.csv
data/val/virology_val.csv
data/val/sociology_val.csv
data/val/college_physics_val.csv
data/val/college_biology_val.csv
data/val/philosophy_val.csv
data/val/high_school_geography_val.csv
data/val/moral_disputes_val.csv
data/val/elementary_mathematics_val.csv
data/val/marketing_val.csv
data/val/college_medicine_val.csv
data/val/professional_psychology_val.csv
data/val/professional_law_val.csv
data/val/high_school_statistics_val.csv
data/val/miscellaneous_val.csv
data/val/high_school_mathematics_val.csv
data/val/human_sexuality_val.csv
data/val/high_school_physics_val.csv
data/val/college_computer_science_val.csv
data/val/high_school_biology_val.csv
data/val/college_mathematics_val.csv
data/val/public_relations_val.csv
data/val/anatomy_val.csv
data/val/global_facts_val.csv
data/val/human_aging_val.csv
data/val/astronomy_val.csv
data/val/abstract_algebra_val.csv
data/val/high_school_microeconomics_val.csv
data/val/high_school_government_and_politics_val.csv
data/val/high_school_chemistry_val.csv
data/val/high_school_computer_science_val.csv
data/val/high_school_world_history_val.csv
data/val/econometrics_val.csv
data/val/nutrition_val.csv
data/val/us_foreign_policy_val.csv
data/val/medical_genetics_val.csv
data/val/world_religions_val.csv
data/val/computer_security_val.csv
data/val/international_law_val.csv
data/val/business_ethics_val.csv
data/val/logical_fallacies_val.csv
data/val/professional_accounting_val.csv
data/val/management_val.csv
data/val/high_school_psychology_val.csv
data/val/conceptual_physics_val.csv
data/val/moral_scenarios_val.csv
data/val/electrical_engineering_val.csv
data/val/clinical_knowledge_val.csv
data/val/high_school_macroeconomics_val.csv
data/val/high_school_us_history_val.csv
data/val/prehistory_val.csv
data/val/formal_logic_val.csv
simonw commented 2 months ago

The data/README.txt file says:

This file contains the dev, val, and test data for our multitask test.

The dev dataset is for few-shot learning to prime the model, and the test set the source of evaluation questions.

The auxiliary_training data could be used for fine-tuning, something important for models without few-shot capabilities. This auxiliary training data comes from other NLP multiple choice datasets such as MCTest (Richardson et al., 2013), RACE (Lai et al., 2017), ARC (Clark et al., 2018, 2016), and OBQA (Mihaylov et al., 2018).

Unless otherwise specified, the questions are in reference to human knowledge as of January 1st, 2020. In the far future, it may be useful to add to the prompt that the question is written for 2020 audiences.

--

If you find this useful in your research, please consider citing the test and also the ETHICS dataset it draws from:

@article{hendryckstest2021,
  title={Measuring Massive Multitask Language Understanding},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}

@article{hendrycks2021ethics,
  title={Aligning AI With Shared Human Values},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}
simonw commented 2 months ago

Here's the contents of the data/dev folder - a set of small (~5 lines each) CSV files extracted into a Gist: https://gist.github.com/simonw/5a275448270a36ba80e35972c3cd9c7e

The CSV files don't have headers which is annoying. Each one looks like this:

What is the embryological origin of the hyoid bone?,The first pharyngeal arch,The first and second pharyngeal arches,The second pharyngeal arch,The second and third pharyngeal arches,D
Which of these branches of the trigeminal nerve contain somatic motor processes?,The supraorbital nerve,The infraorbital nerve,The mental nerve,None of the above,D
The pleura,have no sensory innervation.,are separated by a 2 mm space.,extend into the neck.,are composed of respiratory epithelium.,C
In Angle's Class II Div 2 occlusion there is,excess overbite of the upper lateral incisors.,negative overjet of the upper central incisors.,excess overjet of the upper lateral incisors.,excess overjet of the upper central incisors.,C
Which of the following is the body cavity that contains the pituitary gland?,Abdominal,Cranial,Pleural,Spinal,B
simonw commented 2 months ago

Here's that transformed into JSON with column names:

[
  {
    "Question": "What is the embryological origin of the hyoid bone?",
    "A": "The first pharyngeal arch",
    "B": "The first and second pharyngeal arches",
    "C": "The second pharyngeal arch",
    "D": "The second and third pharyngeal arches",
    "Answer": "D"
  },
  {
    "Question": "Which of these branches of the trigeminal nerve contain somatic motor processes?",
    "A": "The supraorbital nerve",
    "B": "The infraorbital nerve",
    "C": "The mental nerve",
    "D": "None of the above",
    "Answer": "D"
  },
  {
    "Question": "The pleura",
    "A": "have no sensory innervation.",
    "B": "are separated by a 2 mm space.",
    "C": "extend into the neck.",
    "D": "are composed of respiratory epithelium.",
    "Answer": "C"
  },
  {
    "Question": "In Angle's Class II Div 2 occlusion there is",
    "A": "excess overbite of the upper lateral incisors.",
    "B": "negative overjet of the upper central incisors.",
    "C": "excess overjet of the upper lateral incisors.",
    "D": "excess overjet of the upper central incisors.",
    "Answer": "C"
  },
  {
    "Question": "Which of the following is the body cavity that contains the pituitary gland?",
    "A": "Abdominal",
    "B": "Cranial",
    "C": "Pleural",
    "D": "Spinal",
    "Answer": "B"
  }
]
simonw commented 2 months ago

Loaded that into Datasette: https://simon.datasette.cloud/data/mmlu_dev_anatomy

And ran this enrichment as an experiment:

CleanShot 2024-04-21 at 13 18 41@2x

It got 4/5 right:

CleanShot 2024-04-21 at 13 19 30@2x

simonw commented 2 months ago

I'm going to run the full US high school history test through GPT-3.5: https://simon.datasette.cloud/data/mmlu_test_high_school_us_history_test

That's from data/test/high_school_us_history_test.csv which has 204 rows (203 because it is missing a header):

sqlite-utils memory test/high_school_us_history_test.csv:csv 'select count(*) from t'
[{"count(*)": 203}]
simonw commented 2 months ago

CleanShot 2024-04-21 at 13 30 24@2x

simonw commented 2 months ago

Results: https://simon.datasette.site/data/mmlu_test_high_school_us_history_test

simonw commented 2 months ago

This query seems to return the ones with the incorrect answers: https://simon.datasette.site/data/mmlu_test_high_school_us_history_test?_where=gpt_35_turbo_answer%20not%20like%20Answer%20||%20%27%%27

That's using an extra _where= of gpt_35_turbo_answer not like Answer || '%'