Open tstescoTT opened 2 months ago
Llama 3.2 1B using logprob eval Average across all tasks: 0.230928
|---------------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu | 2|none | |acc |↑ |0.2295|± |0.0035|
| - humanities | 2|none | |acc |↑ |0.2421|± |0.0062|
| - formal_logic | 1|none | 5|acc |↑ |0.2857|± |0.0404|
| - high_school_european_history | 1|none | 5|acc |↑ |0.2182|± |0.0323|
| - high_school_us_history | 1|none | 5|acc |↑ |0.2500|± |0.0304|
| - high_school_world_history | 1|none | 5|acc |↑ |0.2700|± |0.0289|
| - international_law | 1|none | 5|acc |↑ |0.2397|± |0.0390|
| - jurisprudence | 1|none | 5|acc |↑ |0.2593|± |0.0424|
| - logical_fallacies | 1|none | 5|acc |↑ |0.2209|± |0.0326|
| - moral_disputes | 1|none | 5|acc |↑ |0.2486|± |0.0233|
| - moral_scenarios | 1|none | 5|acc |↑ |0.2380|± |0.0142|
| - philosophy | 1|none | 5|acc |↑ |0.1865|± |0.0221|
| - prehistory | 1|none | 5|acc |↑ |0.2160|± |0.0229|
| - professional_law | 1|none | 5|acc |↑ |0.2458|± |0.0110|
| - world_religions | 1|none | 5|acc |↑ |0.3216|± |0.0358|
| - other | 2|none | |acc |↑ |0.2398|± |0.0076|
| - business_ethics | 1|none | 5|acc |↑ |0.3000|± |0.0461|
| - clinical_knowledge | 1|none | 5|acc |↑ |0.2151|± |0.0253|
| - college_medicine | 1|none | 5|acc |↑ |0.2081|± |0.0310|
| - global_facts | 1|none | 5|acc |↑ |0.1800|± |0.0386|
| - human_aging | 1|none | 5|acc |↑ |0.3139|± |0.0311|
| - management | 1|none | 5|acc |↑ |0.1748|± |0.0376|
| - marketing | 1|none | 5|acc |↑ |0.2906|± |0.0297|
| - medical_genetics | 1|none | 5|acc |↑ |0.3000|± |0.0461|
| - miscellaneous | 1|none | 5|acc |↑ |0.2375|± |0.0152|
| - nutrition | 1|none | 5|acc |↑ |0.2255|± |0.0239|
| - professional_accounting | 1|none | 5|acc |↑ |0.2340|± |0.0253|
| - professional_medicine | 1|none | 5|acc |↑ |0.1838|± |0.0235|
| - virology | 1|none | 5|acc |↑ |0.2831|± |0.0351|
| - social sciences | 2|none | |acc |↑ |0.2171|± |0.0074|
| - econometrics | 1|none | 5|acc |↑ |0.2368|± |0.0400|
| - high_school_geography | 1|none | 5|acc |↑ |0.1768|± |0.0272|
| - high_school_government_and_politics| 1|none | 5|acc |↑ |0.1969|± |0.0287|
| - high_school_macroeconomics | 1|none | 5|acc |↑ |0.2026|± |0.0204|
| - high_school_microeconomics | 1|none | 5|acc |↑ |0.2101|± |0.0265|
| - high_school_psychology | 1|none | 5|acc |↑ |0.1927|± |0.0169|
| - human_sexuality | 1|none | 5|acc |↑ |0.2595|± |0.0384|
| - professional_psychology | 1|none | 5|acc |↑ |0.2500|± |0.0175|
| - public_relations | 1|none | 5|acc |↑ |0.2182|± |0.0396|
| - security_studies | 1|none | 5|acc |↑ |0.1878|± |0.0250|
| - sociology | 1|none | 5|acc |↑ |0.2438|± |0.0304|
| - us_foreign_policy | 1|none | 5|acc |↑ |0.2800|± |0.0451|
| - stem | 2|none | |acc |↑ |0.2125|± |0.0073|
| - abstract_algebra | 1|none | 5|acc |↑ |0.2200|± |0.0416|
| - anatomy | 1|none | 5|acc |↑ |0.1852|± |0.0336|
| - astronomy | 1|none | 5|acc |↑ |0.1776|± |0.0311|
| - college_biology | 1|none | 5|acc |↑ |0.2569|± |0.0365|
| - college_chemistry | 1|none | 5|acc |↑ |0.2000|± |0.0402|
| - college_computer_science | 1|none | 5|acc |↑ |0.2600|± |0.0441|
| - college_mathematics | 1|none | 5|acc |↑ |0.2100|± |0.0409|
| - college_physics | 1|none | 5|acc |↑ |0.2157|± |0.0409|
| - computer_security | 1|none | 5|acc |↑ |0.2800|± |0.0451|
| - conceptual_physics | 1|none | 5|acc |↑ |0.2638|± |0.0288|
| - electrical_engineering | 1|none | 5|acc |↑ |0.2414|± |0.0357|
| - elementary_mathematics | 1|none | 5|acc |↑ |0.2090|± |0.0209|
| - high_school_biology | 1|none | 5|acc |↑ |0.1774|± |0.0217|
| - high_school_chemistry | 1|none | 5|acc |↑ |0.1527|± |0.0253|
| - high_school_computer_science | 1|none | 5|acc |↑ |0.2500|± |0.0435|
| - high_school_mathematics | 1|none | 5|acc |↑ |0.2111|± |0.0249|
| - high_school_physics | 1|none | 5|acc |↑ |0.1987|± |0.0326|
| - high_school_statistics | 1|none | 5|acc |↑ |0.1528|± |0.0245|
| - machine_learning | 1|none | 5|acc |↑ |0.3125|± |0.0440|
| Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu | 2|none | |acc |↑ |0.2295|± |0.0035|
| - humanities | 2|none | |acc |↑ |0.2421|± |0.0062|
| - other | 2|none | |acc |↑ |0.2398|± |0.0076|
| - social sciences| 2|none | |acc |↑ |0.2171|± |0.0074|
| - stem | 2|none | |acc |↑ |0.2125|± |0.0073| ```
Daily:
Daily:
Using generative version of the eval Average across all tasks: 0.273663
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|---------------------------------------|------:|------------|-----:|-----------|---|-----:|---|-----:|
|mmlu_llama | 1|strict_match| |exact_match|↑ |0.2524|± |0.0034|
| - humanities | 1|strict_match| |exact_match|↑ |0.1095|± |0.0041|
| - formal logic | 1|strict_match| 5|exact_match|↑ |0.3175|± |0.0416|
| - high school european history | 1|strict_match| 5|exact_match|↑ |0.0121|± |0.0085|
| - high school us history | 1|strict_match| 5|exact_match|↑ |0.0098|± |0.0069|
| - high school world history | 1|strict_match| 5|exact_match|↑ |0.0084|± |0.0060|
| - international law | 1|strict_match| 5|exact_match|↑ |0.2479|± |0.0394|
| - jurisprudence | 1|strict_match| 5|exact_match|↑ |0.0833|± |0.0267|
| - logical fallacies | 1|strict_match| 5|exact_match|↑ |0.4663|± |0.0392|
| - moral disputes | 1|strict_match| 5|exact_match|↑ |0.3382|± |0.0255|
| - moral scenarios | 1|strict_match| 5|exact_match|↑ |0.0268|± |0.0054|
| - philosophy | 1|strict_match| 5|exact_match|↑ |0.3312|± |0.0267|
| - prehistory | 1|strict_match| 5|exact_match|↑ |0.1883|± |0.0218|
| - professional law | 1|strict_match| 5|exact_match|↑ |0.0059|± |0.0020|
| - world religions | 1|strict_match| 5|exact_match|↑ |0.2339|± |0.0325|
| - other | 1|strict_match| |exact_match|↑ |0.3566|± |0.0083|
| - business ethics | 1|strict_match| 5|exact_match|↑ |0.2700|± |0.0446|
| - clinical knowledge | 1|strict_match| 5|exact_match|↑ |0.3962|± |0.0301|
| - college medicine | 1|strict_match| 5|exact_match|↑ |0.4046|± |0.0374|
| - global facts | 1|strict_match| 5|exact_match|↑ |0.0500|± |0.0219|
| - human aging | 1|strict_match| 5|exact_match|↑ |0.2511|± |0.0291|
| - management | 1|strict_match| 5|exact_match|↑ |0.3883|± |0.0483|
| - marketing | 1|strict_match| 5|exact_match|↑ |0.2222|± |0.0272|
| - medical genetics | 1|strict_match| 5|exact_match|↑ |0.3900|± |0.0490|
| - miscellaneous | 1|strict_match| 5|exact_match|↑ |0.5441|± |0.0178|
| - nutrition | 1|strict_match| 5|exact_match|↑ |0.3333|± |0.0270|
| - professional accounting | 1|strict_match| 5|exact_match|↑ |0.2695|± |0.0265|
| - professional medicine | 1|strict_match| 5|exact_match|↑ |0.2022|± |0.0244|
| - virology | 1|strict_match| 5|exact_match|↑ |0.3313|± |0.0366|
| - social sciences | 1|strict_match| |exact_match|↑ |0.3416|± |0.0083|
| - econometrics | 1|strict_match| 5|exact_match|↑ |0.2895|± |0.0427|
| - high school geography | 1|strict_match| 5|exact_match|↑ |0.1465|± |0.0252|
| - high school government and politics| 1|strict_match| 5|exact_match|↑ |0.4767|± |0.0360|
| - high school macroeconomics | 1|strict_match| 5|exact_match|↑ |0.1872|± |0.0198|
| - high school microeconomics | 1|strict_match| 5|exact_match|↑ |0.2941|± |0.0296|
| - high school psychology | 1|strict_match| 5|exact_match|↑ |0.4569|± |0.0214|
| - human sexuality | 1|strict_match| 5|exact_match|↑ |0.2290|± |0.0369|
| - professional psychology | 1|strict_match| 5|exact_match|↑ |0.3856|± |0.0197|
| - public relations | 1|strict_match| 5|exact_match|↑ |0.1636|± |0.0354|
| - security studies | 1|strict_match| 5|exact_match|↑ |0.2449|± |0.0275|
| - sociology | 1|strict_match| 5|exact_match|↑ |0.5323|± |0.0353|
| - us foreign policy | 1|strict_match| 5|exact_match|↑ |0.5400|± |0.0501|
| - stem | 0|strict_match| |exact_match|↑ |0.2759|± |0.0075|
| - abstract algebra | 1|strict_match| 5|exact_match|↑ |0.0000|± |0.0000|
| - anatomy | 1|strict_match| 5|exact_match|↑ |0.4741|± |0.0431|
| - astronomy | 1|strict_match| 5|exact_match|↑ |0.4276|± |0.0403|
| - college biology | 1|strict_match| 5|exact_match|↑ |0.4514|± |0.0416|
| - college chemistry | 1|strict_match| 5|exact_match|↑ |0.2400|± |0.0429|
| - college computer science | 1|strict_match| 5|exact_match|↑ |0.3400|± |0.0476|
| - college mathematics | 1|strict_match| 5|exact_match|↑ |0.2200|± |0.0416|
| - college physics | 1|strict_match| 5|exact_match|↑ |0.1275|± |0.0332|
| - computer security | 1|strict_match| 5|exact_match|↑ |0.5200|± |0.0502|
| - conceptual physics | 1|strict_match| 5|exact_match|↑ |0.0894|± |0.0186|
| - electrical engineering | 1|strict_match| 5|exact_match|↑ |0.4552|± |0.0415|
| - elementary mathematics | 1|strict_match| 5|exact_match|↑ |0.2725|± |0.0229|
| - high school biology | 1|strict_match| 5|exact_match|↑ |0.4484|± |0.0283|
| - high school chemistry | 1|strict_match| 5|exact_match|↑ |0.3399|± |0.0333|
| - high school computer science | 1|strict_match| 5|exact_match|↑ |0.3600|± |0.0482|
| - high school mathematics | 1|strict_match| 5|exact_match|↑ |0.0370|± |0.0115|
| - high school physics | 1|strict_match| 5|exact_match|↑ |0.1921|± |0.0322|
| - high school statistics | 1|strict_match| 5|exact_match|↑ |0.1620|± |0.0251|
| - machine learning | 1|strict_match| 5|exact_match|↑ |0.2054|± |0.0383|
| Groups |Version| Filter |n-shot| Metric | |Value | |Stderr|
|------------------|------:|------------|------|-----------|---|-----:|---|-----:|
|mmlu_llama | 1|strict_match| |exact_match|↑ |0.2524|± |0.0034|
| - humanities | 1|strict_match| |exact_match|↑ |0.1095|± |0.0041|
| - other | 1|strict_match| |exact_match|↑ |0.3566|± |0.0083|
| - social sciences| 1|strict_match| |exact_match|↑ |0.3416|± |0.0083|
| - stem | 0|strict_match| |exact_match|↑ |0.2759|± |0.0075|
Daily:
GPU scores match log likelihood variant of the eval Average across all tasks: 0.230928
daily:
Instructions to run evals with logprobs:
Need to update PR: https://github.com/tenstorrent/vllm/pull/81#issuecomment-2942560717
Closing that PR until can update.
Many evals do not generate many tokens (
generate_until
in lm-eval) and instead are:e.g. only 1 multiple choice answer token generated and using the log probs response taking the maximum value for the given multiple choice letters.
Evals:
This is currently blocked by support for log prob responses in TT vLLM integration: https://github.com/tenstorrent/vllm/issues/37