tenstorrent / tt-inference-server

Apache License 2.0
15 stars 2 forks source link

Add support for Log Prob evals #160

Open tstescoTT opened 2 months ago

tstescoTT commented 2 months ago

Many evals do not generate many tokens (generate_until in lm-eval) and instead are:

e.g. only 1 multiple choice answer token generated and using the log probs response taking the maximum value for the given multiple choice letters.

Evals:

This is currently blocked by support for log prob responses in TT vLLM integration: https://github.com/tenstorrent/vllm/issues/37

mvanniasingheTT commented 2 months ago

Llama 3.2 1B using logprob eval Average across all tasks: 0.230928


|---------------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu                                   |      2|none  |      |acc   |↑  |0.2295|±  |0.0035|
| - humanities                          |      2|none  |      |acc   |↑  |0.2421|±  |0.0062|
|  - formal_logic                       |      1|none  |     5|acc   |↑  |0.2857|±  |0.0404|
|  - high_school_european_history       |      1|none  |     5|acc   |↑  |0.2182|±  |0.0323|
|  - high_school_us_history             |      1|none  |     5|acc   |↑  |0.2500|±  |0.0304|
|  - high_school_world_history          |      1|none  |     5|acc   |↑  |0.2700|±  |0.0289|
|  - international_law                  |      1|none  |     5|acc   |↑  |0.2397|±  |0.0390|
|  - jurisprudence                      |      1|none  |     5|acc   |↑  |0.2593|±  |0.0424|
|  - logical_fallacies                  |      1|none  |     5|acc   |↑  |0.2209|±  |0.0326|
|  - moral_disputes                     |      1|none  |     5|acc   |↑  |0.2486|±  |0.0233|
|  - moral_scenarios                    |      1|none  |     5|acc   |↑  |0.2380|±  |0.0142|
|  - philosophy                         |      1|none  |     5|acc   |↑  |0.1865|±  |0.0221|
|  - prehistory                         |      1|none  |     5|acc   |↑  |0.2160|±  |0.0229|
|  - professional_law                   |      1|none  |     5|acc   |↑  |0.2458|±  |0.0110|
|  - world_religions                    |      1|none  |     5|acc   |↑  |0.3216|±  |0.0358|
| - other                               |      2|none  |      |acc   |↑  |0.2398|±  |0.0076|
|  - business_ethics                    |      1|none  |     5|acc   |↑  |0.3000|±  |0.0461|
|  - clinical_knowledge                 |      1|none  |     5|acc   |↑  |0.2151|±  |0.0253|
|  - college_medicine                   |      1|none  |     5|acc   |↑  |0.2081|±  |0.0310|
|  - global_facts                       |      1|none  |     5|acc   |↑  |0.1800|±  |0.0386|
|  - human_aging                        |      1|none  |     5|acc   |↑  |0.3139|±  |0.0311|
|  - management                         |      1|none  |     5|acc   |↑  |0.1748|±  |0.0376|
|  - marketing                          |      1|none  |     5|acc   |↑  |0.2906|±  |0.0297|
|  - medical_genetics                   |      1|none  |     5|acc   |↑  |0.3000|±  |0.0461|
|  - miscellaneous                      |      1|none  |     5|acc   |↑  |0.2375|±  |0.0152|
|  - nutrition                          |      1|none  |     5|acc   |↑  |0.2255|±  |0.0239|
|  - professional_accounting            |      1|none  |     5|acc   |↑  |0.2340|±  |0.0253|
|  - professional_medicine              |      1|none  |     5|acc   |↑  |0.1838|±  |0.0235|
|  - virology                           |      1|none  |     5|acc   |↑  |0.2831|±  |0.0351|
| - social sciences                     |      2|none  |      |acc   |↑  |0.2171|±  |0.0074|
|  - econometrics                       |      1|none  |     5|acc   |↑  |0.2368|±  |0.0400|
|  - high_school_geography              |      1|none  |     5|acc   |↑  |0.1768|±  |0.0272|
|  - high_school_government_and_politics|      1|none  |     5|acc   |↑  |0.1969|±  |0.0287|
|  - high_school_macroeconomics         |      1|none  |     5|acc   |↑  |0.2026|±  |0.0204|
|  - high_school_microeconomics         |      1|none  |     5|acc   |↑  |0.2101|±  |0.0265|
|  - high_school_psychology             |      1|none  |     5|acc   |↑  |0.1927|±  |0.0169|
|  - human_sexuality                    |      1|none  |     5|acc   |↑  |0.2595|±  |0.0384|
|  - professional_psychology            |      1|none  |     5|acc   |↑  |0.2500|±  |0.0175|
|  - public_relations                   |      1|none  |     5|acc   |↑  |0.2182|±  |0.0396|
|  - security_studies                   |      1|none  |     5|acc   |↑  |0.1878|±  |0.0250|
|  - sociology                          |      1|none  |     5|acc   |↑  |0.2438|±  |0.0304|
|  - us_foreign_policy                  |      1|none  |     5|acc   |↑  |0.2800|±  |0.0451|
| - stem                                |      2|none  |      |acc   |↑  |0.2125|±  |0.0073|
|  - abstract_algebra                   |      1|none  |     5|acc   |↑  |0.2200|±  |0.0416|
|  - anatomy                            |      1|none  |     5|acc   |↑  |0.1852|±  |0.0336|
|  - astronomy                          |      1|none  |     5|acc   |↑  |0.1776|±  |0.0311|
|  - college_biology                    |      1|none  |     5|acc   |↑  |0.2569|±  |0.0365|
|  - college_chemistry                  |      1|none  |     5|acc   |↑  |0.2000|±  |0.0402|
|  - college_computer_science           |      1|none  |     5|acc   |↑  |0.2600|±  |0.0441|
|  - college_mathematics                |      1|none  |     5|acc   |↑  |0.2100|±  |0.0409|
|  - college_physics                    |      1|none  |     5|acc   |↑  |0.2157|±  |0.0409|
|  - computer_security                  |      1|none  |     5|acc   |↑  |0.2800|±  |0.0451|
|  - conceptual_physics                 |      1|none  |     5|acc   |↑  |0.2638|±  |0.0288|
|  - electrical_engineering             |      1|none  |     5|acc   |↑  |0.2414|±  |0.0357|
|  - elementary_mathematics             |      1|none  |     5|acc   |↑  |0.2090|±  |0.0209|
|  - high_school_biology                |      1|none  |     5|acc   |↑  |0.1774|±  |0.0217|
|  - high_school_chemistry              |      1|none  |     5|acc   |↑  |0.1527|±  |0.0253|
|  - high_school_computer_science       |      1|none  |     5|acc   |↑  |0.2500|±  |0.0435|
|  - high_school_mathematics            |      1|none  |     5|acc   |↑  |0.2111|±  |0.0249|
|  - high_school_physics                |      1|none  |     5|acc   |↑  |0.1987|±  |0.0326|
|  - high_school_statistics             |      1|none  |     5|acc   |↑  |0.1528|±  |0.0245|
|  - machine_learning                   |      1|none  |     5|acc   |↑  |0.3125|±  |0.0440|

|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.2295|±  |0.0035|
| - humanities     |      2|none  |      |acc   |↑  |0.2421|±  |0.0062|
| - other          |      2|none  |      |acc   |↑  |0.2398|±  |0.0076|
| - social sciences|      2|none  |      |acc   |↑  |0.2171|±  |0.0074|
| - stem           |      2|none  |      |acc   |↑  |0.2125|±  |0.0073| ```
tstescoTT commented 1 month ago

Daily:

tstescoTT commented 1 month ago

Daily:

mvanniasingheTT commented 1 month ago

Using generative version of the eval Average across all tasks: 0.273663

|                 Tasks                 |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|---------------------------------------|------:|------------|-----:|-----------|---|-----:|---|-----:|
|mmlu_llama                             |      1|strict_match|      |exact_match|↑  |0.2524|±  |0.0034|
| - humanities                          |      1|strict_match|      |exact_match|↑  |0.1095|±  |0.0041|
|  - formal logic                       |      1|strict_match|     5|exact_match|↑  |0.3175|±  |0.0416|
|  - high school european history       |      1|strict_match|     5|exact_match|↑  |0.0121|±  |0.0085|
|  - high school us history             |      1|strict_match|     5|exact_match|↑  |0.0098|±  |0.0069|
|  - high school world history          |      1|strict_match|     5|exact_match|↑  |0.0084|±  |0.0060|
|  - international law                  |      1|strict_match|     5|exact_match|↑  |0.2479|±  |0.0394|
|  - jurisprudence                      |      1|strict_match|     5|exact_match|↑  |0.0833|±  |0.0267|
|  - logical fallacies                  |      1|strict_match|     5|exact_match|↑  |0.4663|±  |0.0392|
|  - moral disputes                     |      1|strict_match|     5|exact_match|↑  |0.3382|±  |0.0255|
|  - moral scenarios                    |      1|strict_match|     5|exact_match|↑  |0.0268|±  |0.0054|
|  - philosophy                         |      1|strict_match|     5|exact_match|↑  |0.3312|±  |0.0267|
|  - prehistory                         |      1|strict_match|     5|exact_match|↑  |0.1883|±  |0.0218|
|  - professional law                   |      1|strict_match|     5|exact_match|↑  |0.0059|±  |0.0020|
|  - world religions                    |      1|strict_match|     5|exact_match|↑  |0.2339|±  |0.0325|
| - other                               |      1|strict_match|      |exact_match|↑  |0.3566|±  |0.0083|
|  - business ethics                    |      1|strict_match|     5|exact_match|↑  |0.2700|±  |0.0446|
|  - clinical knowledge                 |      1|strict_match|     5|exact_match|↑  |0.3962|±  |0.0301|
|  - college medicine                   |      1|strict_match|     5|exact_match|↑  |0.4046|±  |0.0374|
|  - global facts                       |      1|strict_match|     5|exact_match|↑  |0.0500|±  |0.0219|
|  - human aging                        |      1|strict_match|     5|exact_match|↑  |0.2511|±  |0.0291|
|  - management                         |      1|strict_match|     5|exact_match|↑  |0.3883|±  |0.0483|
|  - marketing                          |      1|strict_match|     5|exact_match|↑  |0.2222|±  |0.0272|
|  - medical genetics                   |      1|strict_match|     5|exact_match|↑  |0.3900|±  |0.0490|
|  - miscellaneous                      |      1|strict_match|     5|exact_match|↑  |0.5441|±  |0.0178|
|  - nutrition                          |      1|strict_match|     5|exact_match|↑  |0.3333|±  |0.0270|
|  - professional accounting            |      1|strict_match|     5|exact_match|↑  |0.2695|±  |0.0265|
|  - professional medicine              |      1|strict_match|     5|exact_match|↑  |0.2022|±  |0.0244|
|  - virology                           |      1|strict_match|     5|exact_match|↑  |0.3313|±  |0.0366|
| - social sciences                     |      1|strict_match|      |exact_match|↑  |0.3416|±  |0.0083|
|  - econometrics                       |      1|strict_match|     5|exact_match|↑  |0.2895|±  |0.0427|
|  - high school geography              |      1|strict_match|     5|exact_match|↑  |0.1465|±  |0.0252|
|  - high school government and politics|      1|strict_match|     5|exact_match|↑  |0.4767|±  |0.0360|
|  - high school macroeconomics         |      1|strict_match|     5|exact_match|↑  |0.1872|±  |0.0198|
|  - high school microeconomics         |      1|strict_match|     5|exact_match|↑  |0.2941|±  |0.0296|
|  - high school psychology             |      1|strict_match|     5|exact_match|↑  |0.4569|±  |0.0214|
|  - human sexuality                    |      1|strict_match|     5|exact_match|↑  |0.2290|±  |0.0369|
|  - professional psychology            |      1|strict_match|     5|exact_match|↑  |0.3856|±  |0.0197|
|  - public relations                   |      1|strict_match|     5|exact_match|↑  |0.1636|±  |0.0354|
|  - security studies                   |      1|strict_match|     5|exact_match|↑  |0.2449|±  |0.0275|
|  - sociology                          |      1|strict_match|     5|exact_match|↑  |0.5323|±  |0.0353|
|  - us foreign policy                  |      1|strict_match|     5|exact_match|↑  |0.5400|±  |0.0501|
| - stem                                |      0|strict_match|      |exact_match|↑  |0.2759|±  |0.0075|
|  - abstract algebra                   |      1|strict_match|     5|exact_match|↑  |0.0000|±  |0.0000|
|  - anatomy                            |      1|strict_match|     5|exact_match|↑  |0.4741|±  |0.0431|
|  - astronomy                          |      1|strict_match|     5|exact_match|↑  |0.4276|±  |0.0403|
|  - college biology                    |      1|strict_match|     5|exact_match|↑  |0.4514|±  |0.0416|
|  - college chemistry                  |      1|strict_match|     5|exact_match|↑  |0.2400|±  |0.0429|
|  - college computer science           |      1|strict_match|     5|exact_match|↑  |0.3400|±  |0.0476|
|  - college mathematics                |      1|strict_match|     5|exact_match|↑  |0.2200|±  |0.0416|
|  - college physics                    |      1|strict_match|     5|exact_match|↑  |0.1275|±  |0.0332|
|  - computer security                  |      1|strict_match|     5|exact_match|↑  |0.5200|±  |0.0502|
|  - conceptual physics                 |      1|strict_match|     5|exact_match|↑  |0.0894|±  |0.0186|
|  - electrical engineering             |      1|strict_match|     5|exact_match|↑  |0.4552|±  |0.0415|
|  - elementary mathematics             |      1|strict_match|     5|exact_match|↑  |0.2725|±  |0.0229|
|  - high school biology                |      1|strict_match|     5|exact_match|↑  |0.4484|±  |0.0283|
|  - high school chemistry              |      1|strict_match|     5|exact_match|↑  |0.3399|±  |0.0333|
|  - high school computer science       |      1|strict_match|     5|exact_match|↑  |0.3600|±  |0.0482|
|  - high school mathematics            |      1|strict_match|     5|exact_match|↑  |0.0370|±  |0.0115|
|  - high school physics                |      1|strict_match|     5|exact_match|↑  |0.1921|±  |0.0322|
|  - high school statistics             |      1|strict_match|     5|exact_match|↑  |0.1620|±  |0.0251|
|  - machine learning                   |      1|strict_match|     5|exact_match|↑  |0.2054|±  |0.0383|

|      Groups      |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|------------------|------:|------------|------|-----------|---|-----:|---|-----:|
|mmlu_llama        |      1|strict_match|      |exact_match|↑  |0.2524|±  |0.0034|
| - humanities     |      1|strict_match|      |exact_match|↑  |0.1095|±  |0.0041|
| - other          |      1|strict_match|      |exact_match|↑  |0.3566|±  |0.0083|
| - social sciences|      1|strict_match|      |exact_match|↑  |0.3416|±  |0.0083|
| - stem           |      0|strict_match|      |exact_match|↑  |0.2759|±  |0.0075|
tstescoTT commented 1 month ago

Daily:

mvanniasingheTT commented 1 month ago

GPU scores match log likelihood variant of the eval Average across all tasks: 0.230928

tstescoTT commented 1 month ago

daily:

mvanniasingheTT commented 1 month ago

Instructions to run evals with logprobs:

  1. Build Docker container using tt-metal version specified on tt-inference-server main README.md
  2. If the vllm version corresponds to e2e0002ac7dcbc5793983c0f967474d4dcab21f8 , use the following commit of vllm: b7697d5d584d508879bd6699d19f25d4e2f9f800
  3. Run evals as usual
tstescoTT commented 1 week ago

Need to update PR: https://github.com/tenstorrent/vllm/pull/81#issuecomment-2942560717

Closing that PR until can update.