Open simonw opened 2 months ago
The content of that TAR:
data
data/possibly_contaminated_urls.txt
data/test
data/test/moral_scenarios_test.csv
data/test/us_foreign_policy_test.csv
data/test/public_relations_test.csv
data/test/global_facts_test.csv
data/test/electrical_engineering_test.csv
data/test/astronomy_test.csv
data/test/business_ethics_test.csv
data/test/jurisprudence_test.csv
data/test/high_school_chemistry_test.csv
data/test/college_physics_test.csv
data/test/professional_psychology_test.csv
data/test/marketing_test.csv
data/test/management_test.csv
data/test/virology_test.csv
data/test/international_law_test.csv
data/test/high_school_macroeconomics_test.csv
data/test/prehistory_test.csv
data/test/abstract_algebra_test.csv
data/test/high_school_physics_test.csv
data/test/formal_logic_test.csv
data/test/college_medicine_test.csv
data/test/high_school_us_history_test.csv
data/test/moral_disputes_test.csv
data/test/high_school_european_history_test.csv
data/test/clinical_knowledge_test.csv
data/test/world_religions_test.csv
data/test/high_school_microeconomics_test.csv
data/test/professional_law_test.csv
data/test/human_aging_test.csv
data/test/medical_genetics_test.csv
data/test/high_school_geography_test.csv
data/test/high_school_government_and_politics_test.csv
data/test/anatomy_test.csv
data/test/sociology_test.csv
data/test/logical_fallacies_test.csv
data/test/high_school_computer_science_test.csv
data/test/miscellaneous_test.csv
data/test/high_school_world_history_test.csv
data/test/professional_medicine_test.csv
data/test/high_school_biology_test.csv
data/test/high_school_statistics_test.csv
data/test/college_chemistry_test.csv
data/test/nutrition_test.csv
data/test/econometrics_test.csv
data/test/human_sexuality_test.csv
data/test/security_studies_test.csv
data/test/philosophy_test.csv
data/test/elementary_mathematics_test.csv
data/test/college_biology_test.csv
data/test/college_computer_science_test.csv
data/test/machine_learning_test.csv
data/test/professional_accounting_test.csv
data/test/college_mathematics_test.csv
data/test/high_school_mathematics_test.csv
data/test/high_school_psychology_test.csv
data/test/conceptual_physics_test.csv
data/test/computer_security_test.csv
data/auxiliary_train
data/auxiliary_train/obqa.csv
data/auxiliary_train/science_elementary.csv
data/auxiliary_train/arc_easy.csv
data/auxiliary_train/aux_law_90s.csv
data/auxiliary_train/mc_test.csv
data/auxiliary_train/race.csv
data/auxiliary_train/science_middle.csv
data/auxiliary_train/arc_hard.csv
data/dev
data/dev/prehistory_dev.csv
data/dev/formal_logic_dev.csv
data/dev/conceptual_physics_dev.csv
data/dev/moral_scenarios_dev.csv
data/dev/high_school_macroeconomics_dev.csv
data/dev/clinical_knowledge_dev.csv
data/dev/electrical_engineering_dev.csv
data/dev/high_school_us_history_dev.csv
data/dev/computer_security_dev.csv
data/dev/international_law_dev.csv
data/dev/logical_fallacies_dev.csv
data/dev/business_ethics_dev.csv
data/dev/high_school_psychology_dev.csv
data/dev/professional_accounting_dev.csv
data/dev/management_dev.csv
data/dev/medical_genetics_dev.csv
data/dev/world_religions_dev.csv
data/dev/high_school_chemistry_dev.csv
data/dev/high_school_government_and_politics_dev.csv
data/dev/high_school_computer_science_dev.csv
data/dev/high_school_microeconomics_dev.csv
data/dev/econometrics_dev.csv
data/dev/high_school_world_history_dev.csv
data/dev/nutrition_dev.csv
data/dev/us_foreign_policy_dev.csv
data/dev/global_facts_dev.csv
data/dev/human_aging_dev.csv
data/dev/anatomy_dev.csv
data/dev/abstract_algebra_dev.csv
data/dev/astronomy_dev.csv
data/dev/public_relations_dev.csv
data/dev/human_sexuality_dev.csv
data/dev/high_school_biology_dev.csv
data/dev/college_computer_science_dev.csv
data/dev/high_school_physics_dev.csv
data/dev/college_mathematics_dev.csv
data/dev/high_school_mathematics_dev.csv
data/dev/professional_law_dev.csv
data/dev/high_school_statistics_dev.csv
data/dev/miscellaneous_dev.csv
data/dev/college_medicine_dev.csv
data/dev/professional_psychology_dev.csv
data/dev/college_biology_dev.csv
data/dev/college_physics_dev.csv
data/dev/elementary_mathematics_dev.csv
data/dev/moral_disputes_dev.csv
data/dev/philosophy_dev.csv
data/dev/high_school_geography_dev.csv
data/dev/marketing_dev.csv
data/dev/virology_dev.csv
data/dev/jurisprudence_dev.csv
data/dev/sociology_dev.csv
data/dev/college_chemistry_dev.csv
data/dev/professional_medicine_dev.csv
data/dev/high_school_european_history_dev.csv
data/dev/security_studies_dev.csv
data/dev/machine_learning_dev.csv
data/README.txt
data/val
data/val/security_studies_val.csv
data/val/machine_learning_val.csv
data/val/college_chemistry_val.csv
data/val/professional_medicine_val.csv
data/val/high_school_european_history_val.csv
data/val/jurisprudence_val.csv
data/val/virology_val.csv
data/val/sociology_val.csv
data/val/college_physics_val.csv
data/val/college_biology_val.csv
data/val/philosophy_val.csv
data/val/high_school_geography_val.csv
data/val/moral_disputes_val.csv
data/val/elementary_mathematics_val.csv
data/val/marketing_val.csv
data/val/college_medicine_val.csv
data/val/professional_psychology_val.csv
data/val/professional_law_val.csv
data/val/high_school_statistics_val.csv
data/val/miscellaneous_val.csv
data/val/high_school_mathematics_val.csv
data/val/human_sexuality_val.csv
data/val/high_school_physics_val.csv
data/val/college_computer_science_val.csv
data/val/high_school_biology_val.csv
data/val/college_mathematics_val.csv
data/val/public_relations_val.csv
data/val/anatomy_val.csv
data/val/global_facts_val.csv
data/val/human_aging_val.csv
data/val/astronomy_val.csv
data/val/abstract_algebra_val.csv
data/val/high_school_microeconomics_val.csv
data/val/high_school_government_and_politics_val.csv
data/val/high_school_chemistry_val.csv
data/val/high_school_computer_science_val.csv
data/val/high_school_world_history_val.csv
data/val/econometrics_val.csv
data/val/nutrition_val.csv
data/val/us_foreign_policy_val.csv
data/val/medical_genetics_val.csv
data/val/world_religions_val.csv
data/val/computer_security_val.csv
data/val/international_law_val.csv
data/val/business_ethics_val.csv
data/val/logical_fallacies_val.csv
data/val/professional_accounting_val.csv
data/val/management_val.csv
data/val/high_school_psychology_val.csv
data/val/conceptual_physics_val.csv
data/val/moral_scenarios_val.csv
data/val/electrical_engineering_val.csv
data/val/clinical_knowledge_val.csv
data/val/high_school_macroeconomics_val.csv
data/val/high_school_us_history_val.csv
data/val/prehistory_val.csv
data/val/formal_logic_val.csv
The data/README.txt
file says:
This file contains the dev, val, and test data for our multitask test.
The dev dataset is for few-shot learning to prime the model, and the test set the source of evaluation questions.
The auxiliary_training data could be used for fine-tuning, something important for models without few-shot capabilities. This auxiliary training data comes from other NLP multiple choice datasets such as MCTest (Richardson et al., 2013), RACE (Lai et al., 2017), ARC (Clark et al., 2018, 2016), and OBQA (Mihaylov et al., 2018).
Unless otherwise specified, the questions are in reference to human knowledge as of January 1st, 2020. In the far future, it may be useful to add to the prompt that the question is written for 2020 audiences.
--
If you find this useful in your research, please consider citing the test and also the ETHICS dataset it draws from:
@article{hendryckstest2021, title={Measuring Massive Multitask Language Understanding}, author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt}, journal={Proceedings of the International Conference on Learning Representations (ICLR)}, year={2021} } @article{hendrycks2021ethics, title={Aligning AI With Shared Human Values}, author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt}, journal={Proceedings of the International Conference on Learning Representations (ICLR)}, year={2021} }
Here's the contents of the data/dev
folder - a set of small (~5 lines each) CSV files extracted into a Gist: https://gist.github.com/simonw/5a275448270a36ba80e35972c3cd9c7e
The CSV files don't have headers which is annoying. Each one looks like this:
What is the embryological origin of the hyoid bone?,The first pharyngeal arch,The first and second pharyngeal arches,The second pharyngeal arch,The second and third pharyngeal arches,D
Which of these branches of the trigeminal nerve contain somatic motor processes?,The supraorbital nerve,The infraorbital nerve,The mental nerve,None of the above,D
The pleura,have no sensory innervation.,are separated by a 2 mm space.,extend into the neck.,are composed of respiratory epithelium.,C
In Angle's Class II Div 2 occlusion there is,excess overbite of the upper lateral incisors.,negative overjet of the upper central incisors.,excess overjet of the upper lateral incisors.,excess overjet of the upper central incisors.,C
Which of the following is the body cavity that contains the pituitary gland?,Abdominal,Cranial,Pleural,Spinal,B
Here's that transformed into JSON with column names:
[
{
"Question": "What is the embryological origin of the hyoid bone?",
"A": "The first pharyngeal arch",
"B": "The first and second pharyngeal arches",
"C": "The second pharyngeal arch",
"D": "The second and third pharyngeal arches",
"Answer": "D"
},
{
"Question": "Which of these branches of the trigeminal nerve contain somatic motor processes?",
"A": "The supraorbital nerve",
"B": "The infraorbital nerve",
"C": "The mental nerve",
"D": "None of the above",
"Answer": "D"
},
{
"Question": "The pleura",
"A": "have no sensory innervation.",
"B": "are separated by a 2 mm space.",
"C": "extend into the neck.",
"D": "are composed of respiratory epithelium.",
"Answer": "C"
},
{
"Question": "In Angle's Class II Div 2 occlusion there is",
"A": "excess overbite of the upper lateral incisors.",
"B": "negative overjet of the upper central incisors.",
"C": "excess overjet of the upper lateral incisors.",
"D": "excess overjet of the upper central incisors.",
"Answer": "C"
},
{
"Question": "Which of the following is the body cavity that contains the pituitary gland?",
"A": "Abdominal",
"B": "Cranial",
"C": "Pleural",
"D": "Spinal",
"Answer": "B"
}
]
Loaded that into Datasette: https://simon.datasette.cloud/data/mmlu_dev_anatomy
And ran this enrichment as an experiment:
It got 4/5 right:
I'm going to run the full US high school history test through GPT-3.5: https://simon.datasette.cloud/data/mmlu_test_high_school_us_history_test
That's from data/test/high_school_us_history_test.csv
which has 204 rows (203 because it is missing a header):
sqlite-utils memory test/high_school_us_history_test.csv:csv 'select count(*) from t'
[{"count(*)": 203}]
This query seems to return the ones with the incorrect answers: https://simon.datasette.site/data/mmlu_test_high_school_us_history_test?_where=gpt_35_turbo_answer%20not%20like%20Answer%20||%20%27%%27
That's using an extra _where=
of gpt_35_turbo_answer not like Answer || '%'
Running a subset of MMLU would be a great proof of concept for this tool.
https://github.com/hendrycks/test - but you have to download a 158M TAR from https://people.eecs.berkeley.edu/~hendrycks/data.tar