mlcommons / mobile_app_open

Mobile App Open
https://mlcommons.org/en/groups/inference-mobile/
Apache License 2.0
58 stars 28 forks source link

"Accuracy" metric for LLM model(s) #986

Open freedomtan opened 2 months ago

freedomtan commented 2 months ago

Which "accuracy" metric(s) should we use for LLM benchmarking?

mohitmundhragithub commented 2 months ago

In the MLPerf client group, they have used the MMLU score for accuracy computation. They have a separate set of scripts, which processes the output logs generated during the inference to compute the mmlu score. it's an offline process. You are right... it takes several hours to run the whole suite. In that sense TinyMMLU makes more sense.

The scripts that are used in the client group are shared here: https://github.com/mlcommons/mlperf_client_dev/tree/main/tools/mmlu

But, sadly, there is no public repository containing all the mmlu computation scripts.

One big problem I see is that MMLU score doesn't measure the tasks used for performance. As, its measuring the accuracy of only the 1st token. Basically, we are not measuring the whole pipeline. Even in the client group, they are looking for other options.

In the inference group, the metric adopted is RogueN. May be we can explore that as well.

freedomtan commented 2 months ago

Let's check if we can run tinyBenchmarks (100 row validation sets maybe) with llama 3.2 1B and 3B and get acceptable numbers first.

@Mostelk @AhmedTElthakeb and @mohitmundhragithub

mohitmundhragithub commented 2 months ago

Let's check if we can run tinyBenchmarks (100 row validation sets maybe) with llama 3.2 1B and 3B and get acceptable numbers first.

@Mostelk @AhmedTElthakeb and @mohitmundhragithub

I think we discussed llama-3.1-8B and llama-3.2-3B in the meeting... right?

freedomtan commented 2 months ago

Let's check if we can run tinyBenchmarks (100 row validation sets maybe) with llama 3.2 1B and 3B and get acceptable numbers first. @Mostelk @AhmedTElthakeb and @mohitmundhragithub

I think we discussed llama-3.1-8B and llama-3.2-3B in the meeting... right?

right, my bad.

freedomtan commented 2 months ago

baseline numbers

llama 3.2 3B with lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-3B --tasks tinyBenchmarks Tasks Version Filter n-shot Metric Value Stderr
tinyBenchmarks N/A
- tinyArc 0 none 25 acc_norm 0.5055 ± N/A
- tinyGSM8k 0 flexible-extract 5 exact_match 0.3170 ± N/A
strict-match 5 exact_match 0.3170 ± N/A
- tinyHellaswag 0 none 10 acc_norm 0.7742 ± N/A
- tinyMMLU 0 none 0 acc_norm 0.5910 ± N/A
- tinyTruthfulQA 0 none 0 acc 0.4042 ± N/A
- tinyWinogrande 0 none 5 acc_norm 0.6875 ± N/A
llama 3.2 3B Instruct with lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct --tasks tinyBenchmarks Tasks Version Filter n-shot Metric Value Stderr
tinyBenchmarks N/A
- tinyArc 0 none 25 acc_norm 0.5670 ± N/A
- tinyGSM8k 0 flexible-extract 5 exact_match 0.6233 ± N/A
strict-match 5 exact_match 0.5994 ± N/A
- tinyHellaswag 0 none 10 acc_norm 0.7494 ± N/A
- tinyMMLU 0 none 0 acc_norm 0.6244 ± N/A
- tinyTruthfulQA 0 none 0 acc 0.4795 ± N/A
- tinyWinogrande 0 none 5 acc_norm 0.6298 ± N/A
llama 3.1 8B with lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.1-8B --tasks tinyBenchmarks Tasks Version Filter n-shot Metric Value Stderr
tinyBenchmarks N/A
- tinyArc 0 none 25 acc_norm 0.5733 ± N/A
- tinyGSM8k 0 flexible-extract 5 exact_match 0.5037 ± N/A
strict-match 5 exact_match 0.5037 ± N/A
- tinyHellaswag 0 none 10 acc_norm 0.8344 ± N/A
- tinyMMLU 0 none 0 acc_norm 0.6335 ± N/A
- tinyTruthfulQA 0 none 0 acc 0.4776 ± N/A
- tinyWinogrande 0 none 5 acc_norm 0.7507 ± N/A

llama 3.1 8B Instruct with lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --tasks tinyBenchmarks

Tasks Version Filter n-shot Metric Value Stderr
tinyBenchmarks N/A
- tinyArc 0 none 25 acc_norm 0.6533 ± N/A
- tinyGSM8k 0 flexible-extract 5 exact_match 0.7610 ± N/A
strict-match 5 exact_match 0.7289 ± N/A
- tinyHellaswag 0 none 10 acc_norm 0.8054 ± N/A
- tinyMMLU 0 none 0 acc_norm 0.6413 ± N/A
- tinyTruthfulQA 0 none 0 acc 0.5409 ± N/A
- tinyWinogrande 0 none 5 acc_norm 0.7263 ± N/A
freedomtan commented 2 months ago

gemma 3 1b

Tasks Version Filter n-shot Metric Value Stderr
tinyBenchmarks N/A
- tinyArc 0 none 25 acc_norm 0.4571 ± N/A
- tinyGSM8k 0 flexible-extract 5 exact_match 0.2647 ± N/A
strict-match 5 exact_match 0.2514 ± N/A
- tinyHellaswag 0 none 10 acc_norm 0.5312 ± N/A
- tinyMMLU 0 none 0 acc_norm 0.4935 ± N/A
- tinyTruthfulQA 0 none 0 acc 0.3836 ± N/A
- tinyWinogrande 0 none 5 acc_norm 0.6210 ± N/A

gemma 3 1b qat q4_0 unquantized

(https://huggingface.co/google/gemma-3-4b-it-qat-q4_0-unquantized)

Tasks Version Filter n-shot Metric Value Stderr
tinyBenchmarks N/A
- tinyArc 0 none 25 acc_norm 0.4125 ± N/A
- tinyGSM8k 0 flexible-extract 5 exact_match 0.1290 ± N/A
strict-match 5 exact_match 0.1290 ± N/A
- tinyHellaswag 0 none 10 acc_norm 0.5563 ± N/A
- tinyMMLU 0 none 0 acc_norm 0.4550 ± N/A
- tinyTruthfulQA 0 none 0 acc 0.3881 ± N/A
- tinyWinogrande 0 none 5 acc_norm 0.6292 ± N/A

gemma 3 4b

Tasks Version Filter n-shot Metric Value Stderr
tinyBenchmarks N/A
- tinyArc 0 none 25 acc_norm 0.6002 ± N/A
- tinyGSM8k 0 flexible-extract 5 exact_match 0.7760 ± N/A
strict-match 5 exact_match 0.7760 ± N/A
- tinyHellaswag 0 none 10 acc_norm 0.7045 ± N/A
- tinyMMLU 0 none 0 acc_norm 0.6005 ± N/A
- tinyTruthfulQA 0 none 0 acc 0.4619 ± N/A
- tinyWinogrande 0 none 5 acc_norm 0.6961 ± N/A

gemma 3 4b qat q4_0 unquantized

Tasks Version Filter n-shot Metric Value Stderr
tinyBenchmarks N/A
- tinyArc 0 none 25 acc_norm 0.6137 ± N/A
- tinyGSM8k 0 flexible-extract 5 exact_match 0.7574 ± N/A
strict-match 5 exact_match 0.7574 ± N/A
- tinyHellaswag 0 none 10 acc_norm 0.6894 ± N/A
- tinyMMLU 0 none 0 acc_norm 0.6036 ± N/A
- tinyTruthfulQA 0 none 0 acc 0.4426 ± N/A
- tinyWinogrande 0 none 5 acc_norm 0.7046 ± N/A
Mostelk commented 2 months ago

Let us try to quantize these and report accuracy llama 3.1 8B Instruct, llama 3.2 3B Instruct

Aswinoss commented 2 months ago

Image

After some exploration, the use cases we are trying to enable (say summarization, context generation etc.,) are not properly captured by the datasets used in TinyBenchmark. Most of the datasets have single-token outputs (except for GSM8k and AlpacaEval) and none of them capture summarization task in particular.

@Mostelk should we still go with above task of quantization and accuracy collection for these datasets ?

We are also looking into Open-Orca dataset which captures all the tasks we discussed on last meet.

We are also looking whether ROGUE-n score could be a viable accuracy metric for Open-Orca dataset.

freedomtan commented 2 months ago

Image

After some exploration, the use cases we are trying to enable (say summarization, context generation etc.,) are not properly captured by the datasets used in TinyBenchmark. Most of the datasets have single-token outputs (except for GSM8k and AlpacaEval) and none of them capture summarization task in particular.

@Mostelk should we still go with above task of quantization and accuracy collection for these datasets ?

We are also looking into Open-Orca dataset which captures all the tasks we discussed on last meet.

We are also looking whether ROGUE-n score could be a viable accuracy metric for Open-Orca dataset.

@Aswinoss I am open to other benchmarks. However, I don't agree that benchmarks used by the TinyBenchmarks are single-token outputs. They are not.

Aswinoss commented 1 month ago

@freedomtan From the tinyBenchmarks page, the ones I have marked as Single-Token had descriptions similar to single-token outputs (PFB). We have not run it, to be exactly sure. But it looks like, the mentioned benchmarks won't have stream of text or content stream as output (except for GSM8k and AlpacaEval).

For example, TinyTruthfulQA had input/output description in this way, where we ask a question and the LLM responds with label / label list, where the true option has 1 and rest all 0s.

Image

One more thing, we are also okay with any of these accuracy benchmarks. But based on the discussion we had in the last engineering meeting, we thought that these datasets may not be representative enough.

freedomtan commented 1 month ago

@freedomtan From the tinyBenchmarks page, the ones I have marked as Single-Token had descriptions similar to single-token outputs (PFB). We have not run it, to be exactly sure. But it looks like, the mentioned benchmarks won't have stream of text or content stream as output (except for GSM8k and AlpacaEval).

For example, TinyTruthfulQA had input/output description in this way, where we ask a question and the LLM responds with label / label list, where the true option has 1 and rest all 0s.

Image

One more thing, we are also okay with any of these accuracy benchmarks. But based on the discussion we had in the last engineering meeting, we thought that these datasets may not be representative enough.

@Aswinoss it's quite easy to check outputs of those benchmarks. For example, we can

The example you show is kinda internal representation of the dataset.

freedomtan commented 1 month ago

How about quantized model from Meta folks, we know they are available on Huggingface too

Well, they are not in Huggingface safetensor format, that is, we can evaluate them with lm_ev --model hf --model_args pretrained=.... It seems to me that we have to use ExecuTorch (see https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md)

freedomtan commented 1 month ago

@mohitmundhragithub @Aswinoss and @Mostelk OpenOrca is a dataset, not a benchmark.

Mostelk commented 1 month ago

@mohitmundhragithub @freedomtan I agree let us consider other metrics as well, but I prefer one metric that is publicly published by Meta, for example I see here they use SQUAD for reading comprehension, Also we need to decide whether to use general model or instruction tuned model, looks reading comprehension results below are using the original general model https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

Given we used SQUAD before in MobileBert, I don;t think should be hard here as well? Does Meta report numbers for summarization task?

Reading comprehension | SQuAD | 1 | em | 76.4 | 77.0 | 85.6 | 81.8 | 89.3 -- | -- | -- | -- | -- | -- | -- | -- | -- QuAC (F1) | 1 | f1 | 44.4 | 44.9 | 51.1 | 51.1 | 53.6 BoolQ | 0 | acc_char | 75.7 | 75.0 | 79.0 | 79.4 | 80.0 DROP (F1) | 3 | f1 | 58.4 | 59.5 | 79.7 | 79.6 | 84.8
Mostelk commented 1 month ago

@mohitmundhragithub @freedomtan please check these 3.1 8B quantized models, with their evaluation, looks using lm-eval, so may be we adopt some tiny lm-eval ( appears using static weight quantization, but dynamic activation quantization) https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16

mohitmundhragithub commented 1 month ago

@mohitmundhragithub @Aswinoss and @Mostelk OpenOrca is a dataset, not a benchmark.

agreed. its a dataset. corresponding accuracy metric is RogueN

Mostelk commented 1 month ago

This paper https://arxiv.org/pdf/2208.03299 also has interesting code base that may be easier for integration than lm-eval or tiny lm eval, just focus on the zero-shot cases for our case: https://github.com/facebookresearch/atlas?tab=readme-ov-file#tasks , e.g. for MMLU https://github.com/facebookresearch/atlas/blob/main/example_scripts/mmlu/README_MMLU.md

freedomtan commented 1 month ago

This paper https://arxiv.org/pdf/2208.03299 also has interesting code base that may be easier for integration than lm-eval or tiny lm eval, just focus on the zero-shot cases for our case: https://github.com/facebookresearch/atlas?tab=readme-ov-file#tasks , e.g. for MMLU https://github.com/facebookresearch/atlas/blob/main/example_scripts/mmlu/README_MMLU.md

Why do we need that tool? Running either MMLU or tinyMMLU is quite straightforward. Simply feed formatted inputs into the model, then we can obtain reasonable results.

Mostelk commented 1 month ago

This paper https://arxiv.org/pdf/2208.03299 also has interesting code base that may be easier for integration than lm-eval or tiny lm eval, just focus on the zero-shot cases for our case: https://github.com/facebookresearch/atlas?tab=readme-ov-file#tasks , e.g. for MMLU https://github.com/facebookresearch/atlas/blob/main/example_scripts/mmlu/README_MMLU.md

Why do we need that tool? Running either MMLU or tinyMMLU is quite straightforward. Simply feed formatted inputs into the model, then we can obtain reasonable results.

agreed, looks it does that exactly

Mostelk commented 1 month ago

Let us try to quantize these and report accuracy llama 3.1 8B Instruct, llama 3.2 3B Instruct

We will use MMLU (5 shot) to report accuracies after quantizing these models @mohitmundhragithub @AhmedTElthakeb @freedomtan

freedomtan commented 1 month ago

as I said Meta's quantized llama 3.2 3B models could be evaluated with ExecuTorch code,

With

export LLAMA_DIR="/Users/freedom/.llama/checkpoints"
export LLAMA_QUANTIZED_CHECKPOINT=${LLAMA_DIR}/"Llama3.2-3B-Instruct-int4-qlora-eo8/consolidated.00.pth"
export LLAMA_PARAMS=${LLAMA_DIR}/"Llama3.2-3B-Instruct-int4-qlora-eo8/params.json"
export LLAMA_TOKENIZER=${LLAMA_DIR}/"Llama3.2-3B-Instruct-int4-qlora-eo8/tokenizer.model"

python -m executorch.examples.models.llama.eval_llama \
  --model "llama3_2" -qat -lora 16 \
  --preq_mode 8da4w_output_8da8w \
  --preq_group_size 32 \
  --preq_embedding_quantize 8,0 \
  --use_sdpa_with_kv_cache -kv \
  -X --xnnpack-extended-ops \
  -d fp32 \
  --max_seq_length 8192 --max_context_length 8192 \
  --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
  -c "${LLAMA_QUANTIZED_CHECKPOINT:?}" -p "${LLAMA_PARAMS:?}" -t "${LLAMA_TOKENIZER}" \
  --tasks tinyMMLU --num_fewshot 5

Note that tinyMMLU's number of few shot numbers are. set to zero, we had to change them to make --num_fewshot 5 work.

With that, I got

INFO:lm_eval.evaluator:Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
INFO:lm_eval.evaluator:Using pre-initialized model
WARNING:lm_eval.evaluator:Overwriting default num_fewshot of tinyMMLU from 1 to 5
INFO:lm_eval.api.task:Building contexts for tinyMMLU on rank 0...
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1748.27it/s]
INFO:lm_eval.evaluator:Running loglikelihood requests
Running loglikelihood requests: 100%|█████████| 400/400 [17:28<00:00,  2.62s/it]
tinyMMLU: {'alias': 'tinyMMLU', 'acc_norm,none': np.float64(0.5882945804372754), 'acc_norm_stderr,none': 'N/A'}

on a Mac.

Note that it's also possible to evaluate ExecuTorch .pte model accuracy with python -m executorch.examples.models.llama.eval_llama --pte ....

freedomtan commented 1 month ago

To get baseline numbers, llama 3.2 3B Instruction MMLU with lm_eval

$ lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct --tasks mmlu --num_fewshot 5

I got

hf (pretrained=meta-llama/Llama-3.2-3B-Instruct), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1 Tasks Version Filter n-shot Metric Value Stderr
mmlu 2 none acc 0.5959 ± 0.0040
- humanities 2 none acc 0.5637 ± 0.0070
- formal_logic 1 none 5 acc 0.3968 ± 0.0438
- high_school_european_history 1 none 5 acc 0.7697 ± 0.0329
- high_school_us_history 1 none 5 acc 0.7451 ± 0.0306
- high_school_world_history 1 none 5 acc 0.7806 ± 0.0269
- international_law 1 none 5 acc 0.6860 ± 0.0424
- jurisprudence 1 none 5 acc 0.6944 ± 0.0445
- logical_fallacies 1 none 5 acc 0.7055 ± 0.0358
- moral_disputes 1 none 5 acc 0.6879 ± 0.0249
- moral_scenarios 1 none 5 acc 0.4425 ± 0.0166
- philosophy 1 none 5 acc 0.6624 ± 0.0269
- prehistory 1 none 5 acc 0.6481 ± 0.0266
- professional_law 1 none 5 acc 0.4524 ± 0.0127
- world_religions 1 none 5 acc 0.7076 ± 0.0349
- other 2 none acc 0.6505 ± 0.0083
- business_ethics 1 none 5 acc 0.5700 ± 0.0498
- clinical_knowledge 1 none 5 acc 0.6604 ± 0.0291
- college_medicine 1 none 5 acc 0.5549 ± 0.0379
- global_facts 1 none 5 acc 0.3200 ± 0.0469
- human_aging 1 none 5 acc 0.6323 ± 0.0324
- management 1 none 5 acc 0.7087 ± 0.0450
- marketing 1 none 5 acc 0.8547 ± 0.0231
- medical_genetics 1 none 5 acc 0.7500 ± 0.0435
- miscellaneous 1 none 5 acc 0.7292 ± 0.0159
- nutrition 1 none 5 acc 0.6732 ± 0.0269
- professional_accounting 1 none 5 acc 0.4645 ± 0.0298
- professional_medicine 1 none 5 acc 0.7059 ± 0.0277
- virology 1 none 5 acc 0.4337 ± 0.0386
- social sciences 2 none acc 0.6796 ± 0.0082
- econometrics 1 none 5 acc 0.4035 ± 0.0462
- high_school_geography 1 none 5 acc 0.7424 ± 0.0312
- high_school_government_and_politics 1 none 5 acc 0.8031 ± 0.0287
- high_school_macroeconomics 1 none 5 acc 0.5513 ± 0.0252
- high_school_microeconomics 1 none 5 acc 0.6471 ± 0.0310
- high_school_psychology 1 none 5 acc 0.7853 ± 0.0176
- human_sexuality 1 none 5 acc 0.7099 ± 0.0398
- professional_psychology 1 none 5 acc 0.6062 ± 0.0198
- public_relations 1 none 5 acc 0.6636 ± 0.0453
- security_studies 1 none 5 acc 0.6939 ± 0.0295
- sociology 1 none 5 acc 0.7910 ± 0.0287
- us_foreign_policy 1 none 5 acc 0.8000 ± 0.0402
- stem 2 none acc 0.5084 ± 0.0086
- abstract_algebra 1 none 5 acc 0.2900 ± 0.0456
- anatomy 1 none 5 acc 0.5778 ± 0.0427
- astronomy 1 none 5 acc 0.6513 ± 0.0388
- college_biology 1 none 5 acc 0.7222 ± 0.0375
- college_chemistry 1 none 5 acc 0.4600 ± 0.0501
- college_computer_science 1 none 5 acc 0.5700 ± 0.0498
- college_mathematics 1 none 5 acc 0.3000 ± 0.0461
- college_physics 1 none 5 acc 0.3824 ± 0.0484
- computer_security 1 none 5 acc 0.6400 ± 0.0482
- conceptual_physics 1 none 5 acc 0.5106 ± 0.0327
- electrical_engineering 1 none 5 acc 0.6000 ± 0.0408
- elementary_mathematics 1 none 5 acc 0.4471 ± 0.0256
- high_school_biology 1 none 5 acc 0.7194 ± 0.0256
- high_school_chemistry 1 none 5 acc 0.5320 ± 0.0351
- high_school_computer_science 1 none 5 acc 0.6000 ± 0.0492
- high_school_mathematics 1 none 5 acc 0.3519 ± 0.0291
- high_school_physics 1 none 5 acc 0.3709 ± 0.0394
- high_school_statistics 1 none 5 acc 0.4537 ± 0.0340
- machine_learning 1 none 5 acc 0.3661 ± 0.0457
Groups Version Filter n-shot Metric Value Stderr
mmlu 2 none acc 0.5959 ± 0.0040
- humanities 2 none acc 0.5637 ± 0.0070
- other 2 none acc 0.6505 ± 0.0083
- social sciences 2 none acc 0.6796 ± 0.0082
- stem 2 none acc 0.5084 ± 0.0086

with colab + L4 GPU

freedomtan commented 1 month ago

how does ExecuTorch's executorch.examples.models.llama.eval_llama work?

mainly it calls the lm_eval's evaluator.simple_evaluate() see https://github.com/pytorch/executorch/blob/main/examples/models/llama/eval_llama_lib.py#L295-L320 and https://github.com/EleutherAI/lm-evaluation-harness/blob/8bc4afff22e73995883de41018388428e39f8a92/lm_eval/evaluator.py#L47

freedomtan commented 1 month ago

evaluated with lm_eval --model hf --model_args pretrained=meta-llama/... --tasks mmlu --num_fewshot 5 on Colab (w/ L4 GPU)

model MMLU (5-shot)
3.2 1B Instruct 0.4557 ± 0.0041
3.2 3B Instruct 0.5959 ±0.0040
3.1 8B Instruct 0.6820 ± 0.0037
freedomtan commented 1 month ago

MMLU (0-shot, 5-shot):

freedomtan commented 1 month ago

running MMLU with lm_eval

num_fewshot context length
0 < 1024
1 could > 1024
5 could > 3072

with something like

python -m executorch.examples.models.llama.eval_llama -c "${LLAMA_CHECKPOINT:?}" -p "${LLAMA_PARAMS:?}" -t "${LLAMA_TOKENIZER}" -kv -d bf16 --tasks mmlu --num_fewshot ${NUM_FEWSHOT}

0-shot:

INFO:lm_eval.evaluator:Running loglikelihood requests
Running loglikelihood requests:   0% 0/56168 [00:00<?, ?it/s]WARNING:lm_eval.models.huggingface:Combined length of context (984) and continuation (1) exceeds model's maximum length (128). Truncating 858 tokens from the left.
Running loglikelihood requests:   0% 1/56168 [00:07<115:14:52,  7.39s/it]WARNING:lm_eval.models.huggingface:Combined length of context (973) and continuation (1) exceeds model's maximum length (128). Truncating 847 tokens from the left.
Running loglikelihood requests:   0% 5/56168 [00:14<41:25:25,  2.66s/it] WARNING:lm_eval.models.huggingface:Combined length of context (968) and continuation (1) exceeds model's maximum length (128). Truncating 842 tokens from the left.
Running loglikelihood requests:   0% 9/56168 [00:21<33:07:32,  2.12s/it]WARNING:lm_eval.models.huggingface:Combined length of context (945) and continuation (1) exceeds model's maximum length (128). Truncating 819 tokens from the left.
Running loglikelihood requests:   0% 13/56168 [00:28<31:07:29,  2.00s/it]WARNING:lm_eval.models.huggingface:Combined length of context (944) and continuation (1) exceeds model's maximum length (128). Truncating 818 tokens from the left.
Running loglikelihood requests:   0% 17/56168 [00:35<29:52:43,  1.92s/it]WARNING:lm_eval.models.huggingface:Combined length of context (758) and continuation (1) exceeds model's maximum length (128). Truncating 632 tokens from the left.
Running loglikelihood requests:   0% 21/56168 [00:42<28:23:05,  1.82s/it]WARNING:lm_eval.models.huggingface:Combined length of context (743) and continuation (1) exceeds model's maximum length (128). Truncating 617 tokens from the left.
Running loglikelihood requests:   0% 25/56168 [00:50<29:02:31,  1.86s/it]WARNING:lm_eval.models.huggingface:Combined length of context (741) and continuation (1) exceeds model's maximum length (128). Truncating 615 tokens from the left.
Running loglikelihood requests:   0% 29/56168 [00:56<27:53:27,  1.79s/it]WARNING:lm_eval.models.huggingface:Combined length of context (739) and continuation (1) exceeds model's maximum length (128). Truncating 613 tokens from the left.
Running loglikelihood requests:   0% 33/56168 [01:04<28:21:21,  1.82s/it]WARNING:lm_eval.models.huggingface:Combined length of context (737) and continuation (1) exceeds model's maximum length (128). Truncating 611 tokens from the left.
Running loglikelihood requests:   0% 37/56168 [01:10<27:27:25,  1.76s/it]WARNING:lm_eval.models.huggingface:Combined length of context (688) and continuation (1) exceeds model's maximum length (128). Truncating 562 tokens from the left.
Running loglikelihood requests:   0% 41/56168 [01:18<27:59:45,  1.80s/it]WARNING:lm_eval.models.huggingface:Combined length of context (685) and continuation (1) exceeds model's maximum length (128). Truncating 559 tokens from the left.
Running loglikelihood requests:   0% 45/56168 [01:24<27:13:56,  1.75s/it]WARNING:lm_eval.models.huggingface:Combined length of context (673) and continuation (1) exceeds model's maximum length (128). Truncating 547 tokens from the left.
Running loglikelihood requests:   0% 49/56168 [01:32<27:58:14,  1.79s/it]WARNING:lm_eval.models.huggingface:Combined length of context (666) and continuation (1) exceeds model's maximum length (128). Truncating 540 tokens from the left.
Running loglikelihood requests:   0% 53/56168 [01:39<27:15:06,  1.75s/it]WARNING:lm_eval.models.huggingface:Combined length of context (664) and continuation (1) exceeds model's maximum length (128). Truncating 538 tokens from the left.
Running loglikelihood requests:   0% 57/56168 [01:46<27:34:17,  1.77s/it]WARNING:lm_eval.models.huggingface:Combined length of context (653) and continuation (1) exceeds model's maximum length (128). Truncating 527 tokens from the left.
Running loglikelihood requests:   0% 61/56168 [01:53<27:07:42,  1.74s/it]WARNING:lm_eval.models.huggingface:Combined length of context (645) and continuation (1) exceeds model's maximum length (128). Truncating 519 tokens from the left.
Running loglikelihood requests:   0% 65/56168 [02:00<27:21:57,  1.76s/it]WARNING:lm_eval.models.huggingface:Combined length of context (644) and continuation (1) exceeds model's maximum length (128). Truncating 518 tokens from the left.
Running loglikelihood requests:   0% 69/56168 [02:07<27:09:36,  1.74s/it]WARNING:lm_eval.models.huggingface:Combined length of context (644) and continuation (1) exceeds model's maximum length (128). Truncating 518 tokens from the left.
....

1-shot:

INFO:lm_eval.evaluator:Running loglikelihood requests
Running loglikelihood requests:   0% 0/56168 [00:00<?, ?it/s]WARNING:lm_eval.models.huggingface:Combined length of context (1108) and continuation (1) exceeds model's maximum length (128). Truncating 982 tokens from the left.
Running loglikelihood requests:   0% 1/56168 [00:08<129:48:35,  8.32s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1089) and continuation (1) exceeds model's maximum length (128). Truncating 963 tokens from the left.
Running loglikelihood requests:   0% 5/56168 [00:14<40:53:06,  2.62s/it] WARNING:lm_eval.models.huggingface:Combined length of context (1076) and continuation (1) exceeds model's maximum length (128). Truncating 950 tokens from the left.
Running loglikelihood requests:   0% 9/56168 [00:22<34:21:21,  2.20s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1060) and continuation (1) exceeds model's maximum length (128). Truncating 934 tokens from the left.
Running loglikelihood requests:   0% 13/56168 [00:28<30:24:10,  1.95s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1054) and continuation (1) exceeds model's maximum length (128). Truncating 928 tokens from the left.
Running loglikelihood requests:   0% 17/56168 [00:36<30:07:13,  1.93s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1036) and continuation (1) exceeds model's maximum length (128). Truncating 910 tokens from the left.
Running loglikelihood requests:   0% 21/56168 [00:42<28:26:27,  1.82s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1034) and continuation (1) exceeds model's maximum length (128). Truncating 908 tokens from the left.
Running loglikelihood requests:   0% 25/56168 [00:50<28:50:16,  1.85s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1033) and continuation (1) exceeds model's maximum length (128). Truncating 907 tokens from the left.
Running loglikelihood requests:   0% 29/56168 [00:57<27:46:19,  1.78s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1032) and continuation (1) exceeds model's maximum length (128). Truncating 906 tokens from the left.
Running loglikelihood requests:   0% 33/56168 [01:04<27:57:37,  1.79s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1030) and continuation (1) exceeds model's maximum length (128). Truncating 904 tokens from the left.
Running loglikelihood requests:   0% 37/56168 [01:10<27:19:47,  1.75s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1023) and continuation (1) exceeds model's maximum length (128). Truncating 897 tokens from the left.
Running loglikelihood requests:   0% 41/56168 [01:18<27:36:22,  1.77s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1018) and continuation (1) exceeds model's maximum length (128). Truncating 892 tokens from the left.
Running loglikelihood requests:   0% 45/56168 [01:24<27:05:10,  1.74s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1012) and continuation (1) exceeds model's maximum length (128). Truncating 886 tokens from the left.
Running loglikelihood requests:   0% 49/56168 [01:31<27:11:46,  1.74s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1008) and continuation (1) exceeds model's maximum length (128). Truncating 882 tokens from the left.
Running loglikelihood requests:   0% 53/56168 [01:39<27:42:16,  1.78s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1006) and continuation (1) exceeds model's maximum length (128). Truncating 880 tokens from the left.
Running loglikelihood requests:   0% 57/56168 [01:45<26:59:19,  1.73s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1000) and continuation (1) exceeds model's maximum length (128). Truncating 874 tokens from the left.
Running loglikelihood requests:   0% 61/56168 [01:53<27:28:50,  1.76s/it]WARNING:lm_eval.models.huggingface:Combined length of context (999) and continuation (1) exceeds model's maximum length (128). Truncating 873 tokens from the left.
Running loglikelihood requests:   0% 65/56168 [01:59<26:49:48,  1.72s/it]WARNING:lm_eval.models.huggingface:Combined length of context (999) and continuation (1) exceeds model's maximum length (128). Truncating 873 tokens from the left.
Running loglikelihood requests:   0% 69/56168 [02:07<27:32:59,  1.77s/it]WARNING:lm_eval.models.huggingface:Combined length of context (996) and continuation (1) exceeds model's maximum length (128). Truncating 870 tokens from the left.
Running loglikelihood requests:   0% 73/56168 [02:13<26:51:48,  1.72s/it]WARNING:lm_eval.models.huggingface:Combined length of context (995) and continuation (1) exceeds model's maximum length (128). Truncating 869 tokens from the left.
Running loglikelihood requests:   0% 77/56168 [02:21<27:30:05,  1.77s/it]WARNING:lm_eval.models.huggingface:Combined length of context (994) and continuation (1) exceeds model's maximum length (128). Truncating 868 tokens from the left.
Running loglikelihood requests:   0% 81/56168 [02:27<26:51:14,  1.72s/it]WARNING:lm_eval.models.huggingface:Combined length of context (990) and continuation (1) exceeds model's maximum length (128). Truncating 864 tokens from the left.
Running loglikelihood requests:   0% 85/56168 [02:35<27:38:47,  1.77s/it]WARNING:lm_eval.models.huggingface:Combined length of context (990) and continuation (1) exceeds model's maximum length (128). Truncating 864 tokens from the left.
Running loglikelihood requests:   0% 89/56168 [02:41<26:56:56,  1.73s/it]WARNING:lm_eval.models.huggingface:Combined length of context (989) and continuation (1) exceeds model's maximum length (128). Truncating 863 tokens from the left.
...

5-shot:

INFO:lm_eval.evaluator:Running loglikelihood requests
Running loglikelihood requests:   0% 0/56168 [00:00<?, ?it/s]WARNING:lm_eval.models.huggingface:Combined length of context (3081) and continuation (1) exceeds model's maximum length (128). Truncating 2955 tokens from the left.
Running loglikelihood requests:   0% 1/56168 [00:08<135:57:17,  8.71s/it]WARNING:lm_eval.models.huggingface:Combined length of context (3062) and continuation (1) exceeds model's maximum length (128). Truncating 2936 tokens from the left.
Running loglikelihood requests:   0% 5/56168 [00:15<42:45:07,  2.74s/it] WARNING:lm_eval.models.huggingface:Combined length of context (3049) and continuation (1) exceeds model's maximum length (128). Truncating 2923 tokens from the left.
Running loglikelihood requests:   0% 9/56168 [00:22<34:24:05,  2.21s/it]WARNING:lm_eval.models.huggingface:Combined length of context (3033) and continuation (1) exceeds model's maximum length (128). Truncating 2907 tokens from the left.
Running loglikelihood requests:   0% 13/56168 [00:29<31:00:33,  1.99s/it]WARNING:lm_eval.models.huggingface:Combined length of context (3027) and continuation (1) exceeds model's maximum length (128). Truncating 2901 tokens from the left.
Running loglikelihood requests:   0% 17/56168 [00:36<29:24:45,  1.89s/it]WARNING:lm_eval.models.huggingface:Combined length of context (3006) and continuation (1) exceeds model's maximum length (128). Truncating 2880 tokens from the left.
Running loglikelihood requests:   0% 21/56168 [00:43<28:26:43,  1.82s/it]WARNING:lm_eval.models.huggingface:Combined length of context (2985) and continuation (1) exceeds model's maximum length (128). Truncating 2859 tokens from the left.
Running loglikelihood requests:   0% 25/56168 [00:49<27:48:19,  1.78s/it]WARNING:lm_eval.models.huggingface:Combined length of context (2981) and continuation (1) exceeds model's maximum length (128). Truncating 2855 tokens from the left.
Running loglikelihood requests:   0% 29/56168 [00:57<27:42:46,  1.78s/it]WARNING:lm_eval.models.huggingface:Combined length of context (2979) and continuation (1) exceeds model's maximum length (128). Truncating 2853 tokens from the left.
Running loglikelihood requests:   0% 33/56168 [01:03<27:13:25,  1.75s/it]WARNING:lm_eval.models.huggingface:Combined length of context (2973) and continuation (1) exceeds model's maximum length (128). Truncating 2847 tokens from the left.
Running loglikelihood requests:   0% 37/56168 [01:10<27:29:57,  1.76s/it]WARNING:lm_eval.models.huggingface:Combined length of context (2972) and continuation (1) exceeds model's maximum length (128). Truncating 2846 tokens from the left.
Running loglikelihood requests:   0% 41/56168 [01:17<26:50:56,  1.72s/it]WARNING:lm_eval.models.huggingface:Combined length of context (2972) and continuation (1) exceeds model's maximum length (128). Truncating 2846 tokens from the left.
Running loglikelihood requests:   0% 45/56168 [01:24<27:13:43,  1.75s/it]WARNING:lm_eval.models.huggingface:Combined length of context (2969) and continuation (1) exceeds model's maximum length (128). Truncating 2843 tokens from the left.
...
Mostelk commented 1 month ago

Let us check mmlu-llama benchmark, 0 and 5 shots, we also need to decide on input & output sequence lengths

Mostelk commented 1 month ago

How about we use perplexity to measure the accuracy, similar to this ExecuTorch example for Llama 3.1 8B: using LM_EVAL, and using similar settings in this example of max input sequence of 2048, and output of 1000 as them "quoting from here https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md" : "We evaluated WikiText perplexity using LM Eval. Below are the results for two different groupsizes, with max_seq_length 2048, and limit 1000.

Model Baseline (FP32) Groupwise 4-bit (128) Groupwise 4-bit (256)
Llama 3 8B 7.9 9.4 9.7

"

freedomtan commented 1 month ago

evaluated with lm_eval --model hf --model_args pretrained=meta-llama/... --tasks mmlu --num_fewshot 5 on Colab (w/ L4 GPU)

model MMLU (5-shot) 3.2 1B Instruct 0.4557 ± 0.0041 3.2 3B Instruct 0.5959 ±0.0040 3.1 8B Instruct 0.6820 ± 0.0037

evaluated with lm_eval --model hf --model_args pretrained=meta-llama/... --tasks mmlu_llama --num_fewshot 5 on Colab (w/ L4 GPU)

model MMLU (5-shot)
3.2 1B Instruct 0.4607 ± 0.0041
3.2 3B Instruct 0.6173 ± 0.0039
3.1 8B Instruct 0.6840 ± 0.0037
Mostelk commented 1 month ago

How about we use perplexity to measure the accuracy, similar to this ExecuTorch example for Llama 3.1 8B: using LM_EVAL, and using similar settings in this example of max input sequence of 2048, and output of 1000 as them "quoting from here https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md" : "We evaluated WikiText perplexity using LM Eval. Below are the results for two different groupsizes, with max_seq_length 2048, and limit 1000.

Model Baseline (FP32) Groupwise 4-bit (128) Groupwise 4-bit (256) Llama 3 8B 7.9 9.4 9.7 "

This reference also show lm-eval Perplexity numbers of (non-instruct) 3.1 8B, (seems no constraint on context length as above) with different torchao quantizations, https://raw.githubusercontent.com/pytorch/ao/main/torchao/quantization/README.md

freedomtan commented 1 month ago

I’m not certain if summarization is the desired task for generative language models, and if perplexity is an appropriate metric.

freedomtan commented 1 month ago

@mohitmundhragithub @Mostelk

AhmedTElthakeb commented 1 month ago
lm-eval , perplexity on wikitext for meta-llama/Llama-3.1-8B-Instruct Tasks Version Filter n-shot Metric Value Stderr
wikitext 2 none 0 bits_per_byte 0.5891 ± N/A
none 0 byte_perplexity 1.5043 ± N/A
none 0 word_perplexity 8.8784 ± N/A
Mostelk commented 1 month ago
  • let's check more on what WikiText perplexity is.
  • client group: 2k for both input and outputs for previous submission, going for 4k now.
  • let's start with 2048/1024 for inputs/outputs Let's try to report report both MMLU and wikiText perplexity.

@mohitmundhragithub @Mostelk

mmlu_lamma not mmlu & word perplexity (lm-eval)

freedomtan commented 1 month ago

aswin: please share the client working group mmlu evaluation script, so that we can have the same tool. and the numbers you shared to the client working group, please.

@AhmedTElthakeb mmlu-llama with 2K input length, 0-shot, 1-shot, and 5-shot.

Aswinoss commented 4 weeks ago

@freedomtan @Mostelk PFA, the link of the mmlu benchmark script folder used in client development

mmlu client script

Things to note: 1) The script is tuned to work for MLperf client application, we might have to tune it according to our app. 2) Also, for the script to calculate the scores, the result dump should be in a MLperf client result format (if not, the script needs to be changed)

Attaching sample output format of client for reference

results.json

Aswinoss commented 4 weeks ago

Regarding the model scores for client, the submission of all participants for 0.5 and 0.6 are listed here results

1.0 submission final date is this Friday (20/6), so latest scores (for llama2, llama3.1-8B and phi-3.5) are not added in this link, will be available post that date.

AhmedTElthakeb commented 4 weeks ago
Results for meta-llama/Llama-3.1-8B-Instruct with 2K context len: ==== 5 shot ===== Groups Version Filter n-shot Metric Value Stderr
mmlu_llama 1 strict_match exact_match 0.6937 ± 0.0037
- humanities 1 strict_match exact_match 0.6574 ± 0.0067
- other 1 strict_match exact_match 0.7461 ± 0.0075
- social sciences 1 strict_match exact_match 0.7862 ± 0.0073
- stem 0 strict_match exact_match 0.6061 ± 0.0084
==== 1 shot ===== Groups Version Filter n-shot Metric Value Stderr
mmlu_llama 1 strict_match exact_match 0.6858 ± 0.0037
- humanities 1 strict_match exact_match 0.6540 ± 0.0067
- other 1 strict_match exact_match 0.7354 ± 0.0075
- social sciences 1 strict_match exact_match 0.7784 ± 0.0074
- stem 0 strict_match exact_match 0.5940 ± 0.0084
==== 0 shot ===== Groups Version Filter n-shot Metric Value Stderr
mmlu_llama 1 strict_match exact_match 0.6892 ± 0.0037
- humanities 1 strict_match exact_match 0.6582 ± 0.0067
- other 1 strict_match exact_match 0.7435 ± 0.0075
- social sciences 1 strict_match exact_match 0.7748 ± 0.0074
- stem 0 strict_match exact_match 0.5985 ± 0.0084

Command:

lm_eval \
  --model vllm \
  --model_args pretrained=$1,dtype=auto,max_model_len=2048,max_gen_toks=10,tensor_parallel_size=1,enable_prefix_caching=True \
  --tasks mmlu_llama \
  --fewshot_as_multiturn \
  --apply_chat_template \
  --num_fewshot 5 \
  --batch_size auto

Please check this PR https://github.com/EleutherAI/lm-evaluation-harness/pull/2797

lm-eval version: 0.4.8

Mostelk commented 4 weeks ago

@freedomtan @Mostelk PFA, the link of the mmlu benchmark script folder used in client development

mmlu client script

Things to note:

  1. The script is tuned to work for MLperf client application, we might have to tune it according to our app.
  2. Also, for the script to calculate the scores, the result dump should be in a MLperf client result format (if not, the script needs to be changed)

Attaching sample output format of client for reference

results.json

@Aswinoss please provide the link to the MMLU eval script from https://github.com/mlcommons/mlperf_client_dev

freedomtan commented 4 weeks ago

Results for meta-llama/Llama-3.1-8B-Instruct with 2K context len: ==== 5 shot =====

Groups Version Filter n-shot Metric Value Stderr mmlu_llama 1 strict_match exact_match ↑ 0.6937 ± 0.0037

  • humanities 1 strict_match exact_match ↑ 0.6574 ± 0.0067
  • other 1 strict_match exact_match ↑ 0.7461 ± 0.0075
  • social sciences 1 strict_match exact_match ↑ 0.7862 ± 0.0073
  • stem 0 strict_match exact_match ↑ 0.6061 ± 0.0084 ==== 1 shot =====

Groups Version Filter n-shot Metric Value Stderr mmlu_llama 1 strict_match exact_match ↑ 0.6858 ± 0.0037

  • humanities 1 strict_match exact_match ↑ 0.6540 ± 0.0067
  • other 1 strict_match exact_match ↑ 0.7354 ± 0.0075
  • social sciences 1 strict_match exact_match ↑ 0.7784 ± 0.0074
  • stem 0 strict_match exact_match ↑ 0.5940 ± 0.0084 ==== 0 shot =====

Groups Version Filter n-shot Metric Value Stderr mmlu_llama 1 strict_match exact_match ↑ 0.6892 ± 0.0037

  • humanities 1 strict_match exact_match ↑ 0.6582 ± 0.0067
  • other 1 strict_match exact_match ↑ 0.7435 ± 0.0075
  • social sciences 1 strict_match exact_match ↑ 0.7748 ± 0.0074
  • stem 0 strict_match exact_match ↑ 0.5985 ± 0.0084 Command:
lm_eval \
  --model vllm \
  --model_args pretrained=$1,dtype=auto,max_model_len=2048,max_gen_toks=10,tensor_parallel_size=1,enable_prefix_caching=True \
  --tasks mmlu_llama \
  --fewshot_as_multiturn \
  --apply_chat_template \
  --num_fewshot 5 \
  --batch_size auto

Please check this PR EleutherAI/lm-evaluation-harness#2797

lm-eval version: 0.4.8

—fewshot_as_multiturn is to provide fewshot examples as a multi-turn conversation, that is to chop examples into small chunks. We need to discuss if that's what we want to do.

Aswinoss commented 4 weeks ago

@freedomtan @Mostelk PFA, the link of the mmlu benchmark script folder used in client development mmlu client script Things to note:

  1. The script is tuned to work for MLperf client application, we might have to tune it according to our app.
  2. Also, for the script to calculate the scores, the result dump should be in a MLperf client result format (if not, the script needs to be changed)

Attaching sample output format of client for reference results.json

@Aswinoss please provide the link to the MMLU eval script from https://github.com/mlcommons/mlperf_client_dev

My bad. This is the external link for same MMLU_client_dev

freedomtan commented 3 weeks ago

@freedomtan to find the exact mmlu_llama parameters used by lm_eval.

freedomtan commented 3 weeks ago

Let's check exact what does the --fewshot_as_multiturn do and if we want to have something like it does.

It's like basically, for something 5-shot, larger input tensors are chopped into smaller chunk.

freedomtan commented 2 weeks ago

Let's check exact what does the --fewshot_as_multiturn do and if we want to have something like it does.

It's like basically, for something 5-shot, larger input tensors are chopped into smaller chunk.

I read through lm_eval code, it seems my understanding of --fewshot_as_multiturn was almost totally wrong. what does it do

  1. add extra content markup (this could explicitly structure the input as "training")
  2. decrease shots if over limit: if the content from 1. is longer the token limit, keep decreasing the number of shots until it's within the limit.
freedomtan commented 2 weeks ago

mmlu_llama

It seems to the most important difference between mmlu and mmlu_llama is how the prompts are articulated.

freedomtan commented 2 weeks ago

mmlu_llama on device

fewshots as multiturn

lm_eval: that's fine. on-device: we need implement them if we want to use them.

for currently non-on-device evaluation, we can simple use lm_eval + mmlu_llama and fewshot-as-multiturn (5-shot). Let's do 3.1-8B, 3.2-3B, and maybe 3.2-1B (Instruct ones, because for the base models more instructions in the prompts needed). Let's try to report quant numbers for these.

freedomtan commented 1 week ago

convert pytorch model -> quantized tflite model (non-standard one), it's possible to run them on x86_machine, and we have Python API -> .dla format

(https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/evaluator.py#L49)

Q: pytorch model -> onnx (floating point) -> qunatized qnn binary (not runnable on x86). how about: lm_eval on Windows on ARM (let's check if this works (@mohitmundhragithub and @Aswinoss)

Aswinoss commented 3 days ago

We were investigating this method. In our client app, the llama model runs using the GENIE sdk which has the necessary inference pipeline in built to run llama. The bins are created after quantization to be support by this pipeline.

lm_eval (python) will not be able to support GENIE as GENIE provides only C++ APIs.

We shall discuss more on this in tomorrow's meeting.

convert pytorch model -> quantized tflite model (non-standard one), it's possible to run them on x86_machine, and we have Python API -> .dla format

(https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/evaluator.py#L49)

Q: pytorch model -> onnx (floating point) -> qunatized qnn binary (not runnable on x86). how about: lm_eval on Windows on ARM (let's check if this works (@mohitmundhragithub and @Aswinoss)

freedomtan commented 2 days ago
Mostelk commented 1 day ago

Results for meta-llama/Llama-3.1-8B-Instruct with 2K context len: ==== 5 shot =====

Groups Version Filter n-shot Metric Value Stderr mmlu_llama 1 strict_match exact_match ↑ 0.6937 ± 0.0037

  • humanities 1 strict_match exact_match ↑ 0.6574 ± 0.0067
  • other 1 strict_match exact_match ↑ 0.7461 ± 0.0075
  • social sciences 1 strict_match exact_match ↑ 0.7862 ± 0.0073
  • stem 0 strict_match exact_match ↑ 0.6061 ± 0.0084 ==== 1 shot =====

Groups Version Filter n-shot Metric Value Stderr mmlu_llama 1 strict_match exact_match ↑ 0.6858 ± 0.0037

  • humanities 1 strict_match exact_match ↑ 0.6540 ± 0.0067
  • other 1 strict_match exact_match ↑ 0.7354 ± 0.0075
  • social sciences 1 strict_match exact_match ↑ 0.7784 ± 0.0074
  • stem 0 strict_match exact_match ↑ 0.5940 ± 0.0084 ==== 0 shot =====

Groups Version Filter n-shot Metric Value Stderr mmlu_llama 1 strict_match exact_match ↑ 0.6892 ± 0.0037

  • humanities 1 strict_match exact_match ↑ 0.6582 ± 0.0067
  • other 1 strict_match exact_match ↑ 0.7435 ± 0.0075
  • social sciences 1 strict_match exact_match ↑ 0.7748 ± 0.0074
  • stem 0 strict_match exact_match ↑ 0.5985 ± 0.0084 Command:
lm_eval \
  --model vllm \
  --model_args pretrained=$1,dtype=auto,max_model_len=2048,max_gen_toks=10,tensor_parallel_size=1,enable_prefix_caching=True \
  --tasks mmlu_llama \
  --fewshot_as_multiturn \
  --apply_chat_template \
  --num_fewshot 5 \
  --batch_size auto

Please check this PR EleutherAI/lm-evaluation-harness#2797

lm-eval version: 0.4.8

Client working group used 62 threshold for client MMLU test Llama 2 7B - 43 Llama 3.1 8B Instruct - 62 Phi 3.5 Mini Instruct - 59 Phi 4 Reasoning 14B (Exp) - 70