Open freedomtan opened 2 months ago
In the MLPerf client group, they have used the MMLU score for accuracy computation. They have a separate set of scripts, which processes the output logs generated during the inference to compute the mmlu score. it's an offline process. You are right... it takes several hours to run the whole suite. In that sense TinyMMLU makes more sense.
The scripts that are used in the client group are shared here: https://github.com/mlcommons/mlperf_client_dev/tree/main/tools/mmlu
But, sadly, there is no public repository containing all the mmlu computation scripts.
One big problem I see is that MMLU score doesn't measure the tasks used for performance. As, its measuring the accuracy of only the 1st token. Basically, we are not measuring the whole pipeline. Even in the client group, they are looking for other options.
In the inference group, the metric adopted is RogueN. May be we can explore that as well.
Let's check if we can run tinyBenchmarks (100 row validation sets maybe) with llama 3.2 1B and 3B and get acceptable numbers first.
@Mostelk @AhmedTElthakeb and @mohitmundhragithub
Let's check if we can run tinyBenchmarks (100 row validation sets maybe) with llama 3.2 1B and 3B and get acceptable numbers first.
I think we discussed llama-3.1-8B and llama-3.2-3B in the meeting... right?
Let's check if we can run tinyBenchmarks (100 row validation sets maybe) with llama 3.2 1B and 3B and get acceptable numbers first. @Mostelk @AhmedTElthakeb and @mohitmundhragithub
I think we discussed llama-3.1-8B and llama-3.2-3B in the meeting... right?
right, my bad.
baseline numbers
llama 3.2 3B
with lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-3B --tasks tinyBenchmarks |
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|---|
tinyBenchmarks | N/A | ||||||||
- tinyArc | 0 | none | 25 | acc_norm | ↑ | 0.5055 | ± | N/A | |
- tinyGSM8k | 0 | flexible-extract | 5 | exact_match | ↑ | 0.3170 | ± | N/A | |
strict-match | 5 | exact_match | ↑ | 0.3170 | ± | N/A | |||
- tinyHellaswag | 0 | none | 10 | acc_norm | ↑ | 0.7742 | ± | N/A | |
- tinyMMLU | 0 | none | 0 | acc_norm | ↑ | 0.5910 | ± | N/A | |
- tinyTruthfulQA | 0 | none | 0 | acc | ↑ | 0.4042 | ± | N/A | |
- tinyWinogrande | 0 | none | 5 | acc_norm | ↑ | 0.6875 | ± | N/A |
llama 3.2 3B Instruct
with lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct --tasks tinyBenchmarks |
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|---|
tinyBenchmarks | N/A | ||||||||
- tinyArc | 0 | none | 25 | acc_norm | ↑ | 0.5670 | ± | N/A | |
- tinyGSM8k | 0 | flexible-extract | 5 | exact_match | ↑ | 0.6233 | ± | N/A | |
strict-match | 5 | exact_match | ↑ | 0.5994 | ± | N/A | |||
- tinyHellaswag | 0 | none | 10 | acc_norm | ↑ | 0.7494 | ± | N/A | |
- tinyMMLU | 0 | none | 0 | acc_norm | ↑ | 0.6244 | ± | N/A | |
- tinyTruthfulQA | 0 | none | 0 | acc | ↑ | 0.4795 | ± | N/A | |
- tinyWinogrande | 0 | none | 5 | acc_norm | ↑ | 0.6298 | ± | N/A |
llama 3.1 8B
with lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.1-8B --tasks tinyBenchmarks |
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|---|
tinyBenchmarks | N/A | ||||||||
- tinyArc | 0 | none | 25 | acc_norm | ↑ | 0.5733 | ± | N/A | |
- tinyGSM8k | 0 | flexible-extract | 5 | exact_match | ↑ | 0.5037 | ± | N/A | |
strict-match | 5 | exact_match | ↑ | 0.5037 | ± | N/A | |||
- tinyHellaswag | 0 | none | 10 | acc_norm | ↑ | 0.8344 | ± | N/A | |
- tinyMMLU | 0 | none | 0 | acc_norm | ↑ | 0.6335 | ± | N/A | |
- tinyTruthfulQA | 0 | none | 0 | acc | ↑ | 0.4776 | ± | N/A | |
- tinyWinogrande | 0 | none | 5 | acc_norm | ↑ | 0.7507 | ± | N/A |
llama 3.1 8B Instruct
with lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --tasks tinyBenchmarks
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
tinyBenchmarks | N/A | |||||||
- tinyArc | 0 | none | 25 | acc_norm | ↑ | 0.6533 | ± | N/A |
- tinyGSM8k | 0 | flexible-extract | 5 | exact_match | ↑ | 0.7610 | ± | N/A |
strict-match | 5 | exact_match | ↑ | 0.7289 | ± | N/A | ||
- tinyHellaswag | 0 | none | 10 | acc_norm | ↑ | 0.8054 | ± | N/A |
- tinyMMLU | 0 | none | 0 | acc_norm | ↑ | 0.6413 | ± | N/A |
- tinyTruthfulQA | 0 | none | 0 | acc | ↑ | 0.5409 | ± | N/A |
- tinyWinogrande | 0 | none | 5 | acc_norm | ↑ | 0.7263 | ± | N/A |
gemma 3 1b
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
tinyBenchmarks | N/A | |||||||
- tinyArc | 0 | none | 25 | acc_norm | ↑ | 0.4571 | ± | N/A |
- tinyGSM8k | 0 | flexible-extract | 5 | exact_match | ↑ | 0.2647 | ± | N/A |
strict-match | 5 | exact_match | ↑ | 0.2514 | ± | N/A | ||
- tinyHellaswag | 0 | none | 10 | acc_norm | ↑ | 0.5312 | ± | N/A |
- tinyMMLU | 0 | none | 0 | acc_norm | ↑ | 0.4935 | ± | N/A |
- tinyTruthfulQA | 0 | none | 0 | acc | ↑ | 0.3836 | ± | N/A |
- tinyWinogrande | 0 | none | 5 | acc_norm | ↑ | 0.6210 | ± | N/A |
gemma 3 1b qat q4_0 unquantized
(https://huggingface.co/google/gemma-3-4b-it-qat-q4_0-unquantized)
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
tinyBenchmarks | N/A | |||||||
- tinyArc | 0 | none | 25 | acc_norm | ↑ | 0.4125 | ± | N/A |
- tinyGSM8k | 0 | flexible-extract | 5 | exact_match | ↑ | 0.1290 | ± | N/A |
strict-match | 5 | exact_match | ↑ | 0.1290 | ± | N/A | ||
- tinyHellaswag | 0 | none | 10 | acc_norm | ↑ | 0.5563 | ± | N/A |
- tinyMMLU | 0 | none | 0 | acc_norm | ↑ | 0.4550 | ± | N/A |
- tinyTruthfulQA | 0 | none | 0 | acc | ↑ | 0.3881 | ± | N/A |
- tinyWinogrande | 0 | none | 5 | acc_norm | ↑ | 0.6292 | ± | N/A |
gemma 3 4b
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
tinyBenchmarks | N/A | |||||||
- tinyArc | 0 | none | 25 | acc_norm | ↑ | 0.6002 | ± | N/A |
- tinyGSM8k | 0 | flexible-extract | 5 | exact_match | ↑ | 0.7760 | ± | N/A |
strict-match | 5 | exact_match | ↑ | 0.7760 | ± | N/A | ||
- tinyHellaswag | 0 | none | 10 | acc_norm | ↑ | 0.7045 | ± | N/A |
- tinyMMLU | 0 | none | 0 | acc_norm | ↑ | 0.6005 | ± | N/A |
- tinyTruthfulQA | 0 | none | 0 | acc | ↑ | 0.4619 | ± | N/A |
- tinyWinogrande | 0 | none | 5 | acc_norm | ↑ | 0.6961 | ± | N/A |
gemma 3 4b qat q4_0 unquantized
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
tinyBenchmarks | N/A | |||||||
- tinyArc | 0 | none | 25 | acc_norm | ↑ | 0.6137 | ± | N/A |
- tinyGSM8k | 0 | flexible-extract | 5 | exact_match | ↑ | 0.7574 | ± | N/A |
strict-match | 5 | exact_match | ↑ | 0.7574 | ± | N/A | ||
- tinyHellaswag | 0 | none | 10 | acc_norm | ↑ | 0.6894 | ± | N/A |
- tinyMMLU | 0 | none | 0 | acc_norm | ↑ | 0.6036 | ± | N/A |
- tinyTruthfulQA | 0 | none | 0 | acc | ↑ | 0.4426 | ± | N/A |
- tinyWinogrande | 0 | none | 5 | acc_norm | ↑ | 0.7046 | ± | N/A |
Let us try to quantize these and report accuracy llama 3.1 8B Instruct, llama 3.2 3B Instruct
After some exploration, the use cases we are trying to enable (say summarization, context generation etc.,) are not properly captured by the datasets used in TinyBenchmark. Most of the datasets have single-token outputs (except for GSM8k and AlpacaEval) and none of them capture summarization task in particular.
@Mostelk should we still go with above task of quantization and accuracy collection for these datasets ?
We are also looking into Open-Orca dataset which captures all the tasks we discussed on last meet.
We are also looking whether ROGUE-n score could be a viable accuracy metric for Open-Orca dataset.
After some exploration, the use cases we are trying to enable (say summarization, context generation etc.,) are not properly captured by the datasets used in TinyBenchmark. Most of the datasets have single-token outputs (except for GSM8k and AlpacaEval) and none of them capture summarization task in particular.
@Mostelk should we still go with above task of quantization and accuracy collection for these datasets ?
We are also looking into Open-Orca dataset which captures all the tasks we discussed on last meet.
We are also looking whether ROGUE-n score could be a viable accuracy metric for Open-Orca dataset.
@Aswinoss I am open to other benchmarks. However, I don't agree that benchmarks used by the TinyBenchmarks are single-token outputs. They are not.
@freedomtan From the tinyBenchmarks page, the ones I have marked as Single-Token had descriptions similar to single-token outputs (PFB). We have not run it, to be exactly sure. But it looks like, the mentioned benchmarks won't have stream of text or content stream as output (except for GSM8k and AlpacaEval).
For example, TinyTruthfulQA had input/output description in this way, where we ask a question and the LLM responds with label / label list, where the true option has 1 and rest all 0s.
One more thing, we are also okay with any of these accuracy benchmarks. But based on the discussion we had in the last engineering meeting, we thought that these datasets may not be representative enough.
@freedomtan From the tinyBenchmarks page, the ones I have marked as Single-Token had descriptions similar to single-token outputs (PFB). We have not run it, to be exactly sure. But it looks like, the mentioned benchmarks won't have stream of text or content stream as output (except for GSM8k and AlpacaEval).
For example, TinyTruthfulQA had input/output description in this way, where we ask a question and the LLM responds with label / label list, where the true option has 1 and rest all 0s.
One more thing, we are also okay with any of these accuracy benchmarks. But based on the discussion we had in the last engineering meeting, we thought that these datasets may not be representative enough.
@Aswinoss it's quite easy to check outputs of those benchmarks. For example, we can
--output_path
and --log_samples
as described at https://github.com/EleutherAI/lm-evaluation-harness?tab=readme-ov-file#saving--caching-results then we can check those outputsThe example you show is kinda internal representation of the dataset.
How about quantized model from Meta folks, we know they are available on Huggingface too
Well, they are not in Huggingface safetensor format, that is, we can evaluate them with lm_ev --model hf --model_args pretrained=...
. It seems to me that we have to use ExecuTorch (see https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md)
@mohitmundhragithub @Aswinoss and @Mostelk OpenOrca is a dataset, not a benchmark.
@mohitmundhragithub @freedomtan I agree let us consider other metrics as well, but I prefer one metric that is publicly published by Meta, for example I see here they use SQUAD for reading comprehension, Also we need to decide whether to use general model or instruction tuned model, looks reading comprehension results below are using the original general model https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
Given we used SQUAD before in MobileBert, I don;t think should be hard here as well? Does Meta report numbers for summarization task?
Reading comprehension | SQuAD | 1 | em | 76.4 | 77.0 | 85.6 | 81.8 | 89.3 -- | -- | -- | -- | -- | -- | -- | -- | -- QuAC (F1) | 1 | f1 | 44.4 | 44.9 | 51.1 | 51.1 | 53.6 BoolQ | 0 | acc_char | 75.7 | 75.0 | 79.0 | 79.4 | 80.0 DROP (F1) | 3 | f1 | 58.4 | 59.5 | 79.7 | 79.6 | 84.8@mohitmundhragithub @freedomtan please check these 3.1 8B quantized models, with their evaluation, looks using lm-eval, so may be we adopt some tiny lm-eval ( appears using static weight quantization, but dynamic activation quantization) https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16
@mohitmundhragithub @Aswinoss and @Mostelk OpenOrca is a dataset, not a benchmark.
agreed. its a dataset. corresponding accuracy metric is RogueN
This paper https://arxiv.org/pdf/2208.03299 also has interesting code base that may be easier for integration than lm-eval or tiny lm eval, just focus on the zero-shot cases for our case: https://github.com/facebookresearch/atlas?tab=readme-ov-file#tasks , e.g. for MMLU https://github.com/facebookresearch/atlas/blob/main/example_scripts/mmlu/README_MMLU.md
This paper https://arxiv.org/pdf/2208.03299 also has interesting code base that may be easier for integration than lm-eval or tiny lm eval, just focus on the zero-shot cases for our case: https://github.com/facebookresearch/atlas?tab=readme-ov-file#tasks , e.g. for MMLU https://github.com/facebookresearch/atlas/blob/main/example_scripts/mmlu/README_MMLU.md
Why do we need that tool? Running either MMLU or tinyMMLU is quite straightforward. Simply feed formatted inputs into the model, then we can obtain reasonable results.
This paper https://arxiv.org/pdf/2208.03299 also has interesting code base that may be easier for integration than lm-eval or tiny lm eval, just focus on the zero-shot cases for our case: https://github.com/facebookresearch/atlas?tab=readme-ov-file#tasks , e.g. for MMLU https://github.com/facebookresearch/atlas/blob/main/example_scripts/mmlu/README_MMLU.md
Why do we need that tool? Running either MMLU or tinyMMLU is quite straightforward. Simply feed formatted inputs into the model, then we can obtain reasonable results.
agreed, looks it does that exactly
Let us try to quantize these and report accuracy llama 3.1 8B Instruct, llama 3.2 3B Instruct
We will use MMLU (5 shot) to report accuracies after quantizing these models @mohitmundhragithub @AhmedTElthakeb @freedomtan
as I said Meta's quantized llama 3.2 3B models could be evaluated with ExecuTorch code,
With
export LLAMA_DIR="/Users/freedom/.llama/checkpoints"
export LLAMA_QUANTIZED_CHECKPOINT=${LLAMA_DIR}/"Llama3.2-3B-Instruct-int4-qlora-eo8/consolidated.00.pth"
export LLAMA_PARAMS=${LLAMA_DIR}/"Llama3.2-3B-Instruct-int4-qlora-eo8/params.json"
export LLAMA_TOKENIZER=${LLAMA_DIR}/"Llama3.2-3B-Instruct-int4-qlora-eo8/tokenizer.model"
python -m executorch.examples.models.llama.eval_llama \
--model "llama3_2" -qat -lora 16 \
--preq_mode 8da4w_output_8da8w \
--preq_group_size 32 \
--preq_embedding_quantize 8,0 \
--use_sdpa_with_kv_cache -kv \
-X --xnnpack-extended-ops \
-d fp32 \
--max_seq_length 8192 --max_context_length 8192 \
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
-c "${LLAMA_QUANTIZED_CHECKPOINT:?}" -p "${LLAMA_PARAMS:?}" -t "${LLAMA_TOKENIZER}" \
--tasks tinyMMLU --num_fewshot 5
Note that tinyMMLU's number of few shot numbers are. set to zero, we had to change them to make --num_fewshot 5
work.
With that, I got
INFO:lm_eval.evaluator:Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
INFO:lm_eval.evaluator:Using pre-initialized model
WARNING:lm_eval.evaluator:Overwriting default num_fewshot of tinyMMLU from 1 to 5
INFO:lm_eval.api.task:Building contexts for tinyMMLU on rank 0...
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1748.27it/s]
INFO:lm_eval.evaluator:Running loglikelihood requests
Running loglikelihood requests: 100%|█████████| 400/400 [17:28<00:00, 2.62s/it]
tinyMMLU: {'alias': 'tinyMMLU', 'acc_norm,none': np.float64(0.5882945804372754), 'acc_norm_stderr,none': 'N/A'}
on a Mac.
Note that it's also possible to evaluate ExecuTorch .pte model accuracy with python -m executorch.examples.models.llama.eval_llama --pte ....
To get baseline numbers, llama 3.2 3B Instruction MMLU with lm_eval
$ lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct --tasks mmlu --num_fewshot 5
I got
hf (pretrained=meta-llama/Llama-3.2-3B-Instruct), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1 | Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.5959 | ± | 0.0040 | ||
- humanities | 2 | none | acc | ↑ | 0.5637 | ± | 0.0070 | ||
- formal_logic | 1 | none | 5 | acc | ↑ | 0.3968 | ± | 0.0438 | |
- high_school_european_history | 1 | none | 5 | acc | ↑ | 0.7697 | ± | 0.0329 | |
- high_school_us_history | 1 | none | 5 | acc | ↑ | 0.7451 | ± | 0.0306 | |
- high_school_world_history | 1 | none | 5 | acc | ↑ | 0.7806 | ± | 0.0269 | |
- international_law | 1 | none | 5 | acc | ↑ | 0.6860 | ± | 0.0424 | |
- jurisprudence | 1 | none | 5 | acc | ↑ | 0.6944 | ± | 0.0445 | |
- logical_fallacies | 1 | none | 5 | acc | ↑ | 0.7055 | ± | 0.0358 | |
- moral_disputes | 1 | none | 5 | acc | ↑ | 0.6879 | ± | 0.0249 | |
- moral_scenarios | 1 | none | 5 | acc | ↑ | 0.4425 | ± | 0.0166 | |
- philosophy | 1 | none | 5 | acc | ↑ | 0.6624 | ± | 0.0269 | |
- prehistory | 1 | none | 5 | acc | ↑ | 0.6481 | ± | 0.0266 | |
- professional_law | 1 | none | 5 | acc | ↑ | 0.4524 | ± | 0.0127 | |
- world_religions | 1 | none | 5 | acc | ↑ | 0.7076 | ± | 0.0349 | |
- other | 2 | none | acc | ↑ | 0.6505 | ± | 0.0083 | ||
- business_ethics | 1 | none | 5 | acc | ↑ | 0.5700 | ± | 0.0498 | |
- clinical_knowledge | 1 | none | 5 | acc | ↑ | 0.6604 | ± | 0.0291 | |
- college_medicine | 1 | none | 5 | acc | ↑ | 0.5549 | ± | 0.0379 | |
- global_facts | 1 | none | 5 | acc | ↑ | 0.3200 | ± | 0.0469 | |
- human_aging | 1 | none | 5 | acc | ↑ | 0.6323 | ± | 0.0324 | |
- management | 1 | none | 5 | acc | ↑ | 0.7087 | ± | 0.0450 | |
- marketing | 1 | none | 5 | acc | ↑ | 0.8547 | ± | 0.0231 | |
- medical_genetics | 1 | none | 5 | acc | ↑ | 0.7500 | ± | 0.0435 | |
- miscellaneous | 1 | none | 5 | acc | ↑ | 0.7292 | ± | 0.0159 | |
- nutrition | 1 | none | 5 | acc | ↑ | 0.6732 | ± | 0.0269 | |
- professional_accounting | 1 | none | 5 | acc | ↑ | 0.4645 | ± | 0.0298 | |
- professional_medicine | 1 | none | 5 | acc | ↑ | 0.7059 | ± | 0.0277 | |
- virology | 1 | none | 5 | acc | ↑ | 0.4337 | ± | 0.0386 | |
- social sciences | 2 | none | acc | ↑ | 0.6796 | ± | 0.0082 | ||
- econometrics | 1 | none | 5 | acc | ↑ | 0.4035 | ± | 0.0462 | |
- high_school_geography | 1 | none | 5 | acc | ↑ | 0.7424 | ± | 0.0312 | |
- high_school_government_and_politics | 1 | none | 5 | acc | ↑ | 0.8031 | ± | 0.0287 | |
- high_school_macroeconomics | 1 | none | 5 | acc | ↑ | 0.5513 | ± | 0.0252 | |
- high_school_microeconomics | 1 | none | 5 | acc | ↑ | 0.6471 | ± | 0.0310 | |
- high_school_psychology | 1 | none | 5 | acc | ↑ | 0.7853 | ± | 0.0176 | |
- human_sexuality | 1 | none | 5 | acc | ↑ | 0.7099 | ± | 0.0398 | |
- professional_psychology | 1 | none | 5 | acc | ↑ | 0.6062 | ± | 0.0198 | |
- public_relations | 1 | none | 5 | acc | ↑ | 0.6636 | ± | 0.0453 | |
- security_studies | 1 | none | 5 | acc | ↑ | 0.6939 | ± | 0.0295 | |
- sociology | 1 | none | 5 | acc | ↑ | 0.7910 | ± | 0.0287 | |
- us_foreign_policy | 1 | none | 5 | acc | ↑ | 0.8000 | ± | 0.0402 | |
- stem | 2 | none | acc | ↑ | 0.5084 | ± | 0.0086 | ||
- abstract_algebra | 1 | none | 5 | acc | ↑ | 0.2900 | ± | 0.0456 | |
- anatomy | 1 | none | 5 | acc | ↑ | 0.5778 | ± | 0.0427 | |
- astronomy | 1 | none | 5 | acc | ↑ | 0.6513 | ± | 0.0388 | |
- college_biology | 1 | none | 5 | acc | ↑ | 0.7222 | ± | 0.0375 | |
- college_chemistry | 1 | none | 5 | acc | ↑ | 0.4600 | ± | 0.0501 | |
- college_computer_science | 1 | none | 5 | acc | ↑ | 0.5700 | ± | 0.0498 | |
- college_mathematics | 1 | none | 5 | acc | ↑ | 0.3000 | ± | 0.0461 | |
- college_physics | 1 | none | 5 | acc | ↑ | 0.3824 | ± | 0.0484 | |
- computer_security | 1 | none | 5 | acc | ↑ | 0.6400 | ± | 0.0482 | |
- conceptual_physics | 1 | none | 5 | acc | ↑ | 0.5106 | ± | 0.0327 | |
- electrical_engineering | 1 | none | 5 | acc | ↑ | 0.6000 | ± | 0.0408 | |
- elementary_mathematics | 1 | none | 5 | acc | ↑ | 0.4471 | ± | 0.0256 | |
- high_school_biology | 1 | none | 5 | acc | ↑ | 0.7194 | ± | 0.0256 | |
- high_school_chemistry | 1 | none | 5 | acc | ↑ | 0.5320 | ± | 0.0351 | |
- high_school_computer_science | 1 | none | 5 | acc | ↑ | 0.6000 | ± | 0.0492 | |
- high_school_mathematics | 1 | none | 5 | acc | ↑ | 0.3519 | ± | 0.0291 | |
- high_school_physics | 1 | none | 5 | acc | ↑ | 0.3709 | ± | 0.0394 | |
- high_school_statistics | 1 | none | 5 | acc | ↑ | 0.4537 | ± | 0.0340 | |
- machine_learning | 1 | none | 5 | acc | ↑ | 0.3661 | ± | 0.0457 |
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.5959 | ± | 0.0040 | |
- humanities | 2 | none | acc | ↑ | 0.5637 | ± | 0.0070 | |
- other | 2 | none | acc | ↑ | 0.6505 | ± | 0.0083 | |
- social sciences | 2 | none | acc | ↑ | 0.6796 | ± | 0.0082 | |
- stem | 2 | none | acc | ↑ | 0.5084 | ± | 0.0086 |
with colab + L4 GPU
how does ExecuTorch's executorch.examples.models.llama.eval_llama work?
mainly it calls the lm_eval's evaluator.simple_evaluate() see https://github.com/pytorch/executorch/blob/main/examples/models/llama/eval_llama_lib.py#L295-L320 and https://github.com/EleutherAI/lm-evaluation-harness/blob/8bc4afff22e73995883de41018388428e39f8a92/lm_eval/evaluator.py#L47
evaluated with lm_eval --model hf --model_args pretrained=meta-llama/... --tasks mmlu --num_fewshot 5
on Colab (w/ L4 GPU)
model | MMLU (5-shot) |
---|---|
3.2 1B Instruct | 0.4557 ± 0.0041 |
3.2 3B Instruct | 0.5959 ±0.0040 |
3.1 8B Instruct | 0.6820 ± 0.0037 |
MMLU (0-shot, 5-shot):
running MMLU with lm_eval
num_fewshot | context length |
---|---|
0 | < 1024 |
1 | could > 1024 |
5 | could > 3072 |
with something like
python -m executorch.examples.models.llama.eval_llama -c "${LLAMA_CHECKPOINT:?}" -p "${LLAMA_PARAMS:?}" -t "${LLAMA_TOKENIZER}" -kv -d bf16 --tasks mmlu --num_fewshot ${NUM_FEWSHOT}
0-shot:
INFO:lm_eval.evaluator:Running loglikelihood requests
Running loglikelihood requests: 0% 0/56168 [00:00<?, ?it/s]WARNING:lm_eval.models.huggingface:Combined length of context (984) and continuation (1) exceeds model's maximum length (128). Truncating 858 tokens from the left.
Running loglikelihood requests: 0% 1/56168 [00:07<115:14:52, 7.39s/it]WARNING:lm_eval.models.huggingface:Combined length of context (973) and continuation (1) exceeds model's maximum length (128). Truncating 847 tokens from the left.
Running loglikelihood requests: 0% 5/56168 [00:14<41:25:25, 2.66s/it] WARNING:lm_eval.models.huggingface:Combined length of context (968) and continuation (1) exceeds model's maximum length (128). Truncating 842 tokens from the left.
Running loglikelihood requests: 0% 9/56168 [00:21<33:07:32, 2.12s/it]WARNING:lm_eval.models.huggingface:Combined length of context (945) and continuation (1) exceeds model's maximum length (128). Truncating 819 tokens from the left.
Running loglikelihood requests: 0% 13/56168 [00:28<31:07:29, 2.00s/it]WARNING:lm_eval.models.huggingface:Combined length of context (944) and continuation (1) exceeds model's maximum length (128). Truncating 818 tokens from the left.
Running loglikelihood requests: 0% 17/56168 [00:35<29:52:43, 1.92s/it]WARNING:lm_eval.models.huggingface:Combined length of context (758) and continuation (1) exceeds model's maximum length (128). Truncating 632 tokens from the left.
Running loglikelihood requests: 0% 21/56168 [00:42<28:23:05, 1.82s/it]WARNING:lm_eval.models.huggingface:Combined length of context (743) and continuation (1) exceeds model's maximum length (128). Truncating 617 tokens from the left.
Running loglikelihood requests: 0% 25/56168 [00:50<29:02:31, 1.86s/it]WARNING:lm_eval.models.huggingface:Combined length of context (741) and continuation (1) exceeds model's maximum length (128). Truncating 615 tokens from the left.
Running loglikelihood requests: 0% 29/56168 [00:56<27:53:27, 1.79s/it]WARNING:lm_eval.models.huggingface:Combined length of context (739) and continuation (1) exceeds model's maximum length (128). Truncating 613 tokens from the left.
Running loglikelihood requests: 0% 33/56168 [01:04<28:21:21, 1.82s/it]WARNING:lm_eval.models.huggingface:Combined length of context (737) and continuation (1) exceeds model's maximum length (128). Truncating 611 tokens from the left.
Running loglikelihood requests: 0% 37/56168 [01:10<27:27:25, 1.76s/it]WARNING:lm_eval.models.huggingface:Combined length of context (688) and continuation (1) exceeds model's maximum length (128). Truncating 562 tokens from the left.
Running loglikelihood requests: 0% 41/56168 [01:18<27:59:45, 1.80s/it]WARNING:lm_eval.models.huggingface:Combined length of context (685) and continuation (1) exceeds model's maximum length (128). Truncating 559 tokens from the left.
Running loglikelihood requests: 0% 45/56168 [01:24<27:13:56, 1.75s/it]WARNING:lm_eval.models.huggingface:Combined length of context (673) and continuation (1) exceeds model's maximum length (128). Truncating 547 tokens from the left.
Running loglikelihood requests: 0% 49/56168 [01:32<27:58:14, 1.79s/it]WARNING:lm_eval.models.huggingface:Combined length of context (666) and continuation (1) exceeds model's maximum length (128). Truncating 540 tokens from the left.
Running loglikelihood requests: 0% 53/56168 [01:39<27:15:06, 1.75s/it]WARNING:lm_eval.models.huggingface:Combined length of context (664) and continuation (1) exceeds model's maximum length (128). Truncating 538 tokens from the left.
Running loglikelihood requests: 0% 57/56168 [01:46<27:34:17, 1.77s/it]WARNING:lm_eval.models.huggingface:Combined length of context (653) and continuation (1) exceeds model's maximum length (128). Truncating 527 tokens from the left.
Running loglikelihood requests: 0% 61/56168 [01:53<27:07:42, 1.74s/it]WARNING:lm_eval.models.huggingface:Combined length of context (645) and continuation (1) exceeds model's maximum length (128). Truncating 519 tokens from the left.
Running loglikelihood requests: 0% 65/56168 [02:00<27:21:57, 1.76s/it]WARNING:lm_eval.models.huggingface:Combined length of context (644) and continuation (1) exceeds model's maximum length (128). Truncating 518 tokens from the left.
Running loglikelihood requests: 0% 69/56168 [02:07<27:09:36, 1.74s/it]WARNING:lm_eval.models.huggingface:Combined length of context (644) and continuation (1) exceeds model's maximum length (128). Truncating 518 tokens from the left.
....
1-shot:
INFO:lm_eval.evaluator:Running loglikelihood requests
Running loglikelihood requests: 0% 0/56168 [00:00<?, ?it/s]WARNING:lm_eval.models.huggingface:Combined length of context (1108) and continuation (1) exceeds model's maximum length (128). Truncating 982 tokens from the left.
Running loglikelihood requests: 0% 1/56168 [00:08<129:48:35, 8.32s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1089) and continuation (1) exceeds model's maximum length (128). Truncating 963 tokens from the left.
Running loglikelihood requests: 0% 5/56168 [00:14<40:53:06, 2.62s/it] WARNING:lm_eval.models.huggingface:Combined length of context (1076) and continuation (1) exceeds model's maximum length (128). Truncating 950 tokens from the left.
Running loglikelihood requests: 0% 9/56168 [00:22<34:21:21, 2.20s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1060) and continuation (1) exceeds model's maximum length (128). Truncating 934 tokens from the left.
Running loglikelihood requests: 0% 13/56168 [00:28<30:24:10, 1.95s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1054) and continuation (1) exceeds model's maximum length (128). Truncating 928 tokens from the left.
Running loglikelihood requests: 0% 17/56168 [00:36<30:07:13, 1.93s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1036) and continuation (1) exceeds model's maximum length (128). Truncating 910 tokens from the left.
Running loglikelihood requests: 0% 21/56168 [00:42<28:26:27, 1.82s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1034) and continuation (1) exceeds model's maximum length (128). Truncating 908 tokens from the left.
Running loglikelihood requests: 0% 25/56168 [00:50<28:50:16, 1.85s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1033) and continuation (1) exceeds model's maximum length (128). Truncating 907 tokens from the left.
Running loglikelihood requests: 0% 29/56168 [00:57<27:46:19, 1.78s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1032) and continuation (1) exceeds model's maximum length (128). Truncating 906 tokens from the left.
Running loglikelihood requests: 0% 33/56168 [01:04<27:57:37, 1.79s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1030) and continuation (1) exceeds model's maximum length (128). Truncating 904 tokens from the left.
Running loglikelihood requests: 0% 37/56168 [01:10<27:19:47, 1.75s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1023) and continuation (1) exceeds model's maximum length (128). Truncating 897 tokens from the left.
Running loglikelihood requests: 0% 41/56168 [01:18<27:36:22, 1.77s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1018) and continuation (1) exceeds model's maximum length (128). Truncating 892 tokens from the left.
Running loglikelihood requests: 0% 45/56168 [01:24<27:05:10, 1.74s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1012) and continuation (1) exceeds model's maximum length (128). Truncating 886 tokens from the left.
Running loglikelihood requests: 0% 49/56168 [01:31<27:11:46, 1.74s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1008) and continuation (1) exceeds model's maximum length (128). Truncating 882 tokens from the left.
Running loglikelihood requests: 0% 53/56168 [01:39<27:42:16, 1.78s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1006) and continuation (1) exceeds model's maximum length (128). Truncating 880 tokens from the left.
Running loglikelihood requests: 0% 57/56168 [01:45<26:59:19, 1.73s/it]WARNING:lm_eval.models.huggingface:Combined length of context (1000) and continuation (1) exceeds model's maximum length (128). Truncating 874 tokens from the left.
Running loglikelihood requests: 0% 61/56168 [01:53<27:28:50, 1.76s/it]WARNING:lm_eval.models.huggingface:Combined length of context (999) and continuation (1) exceeds model's maximum length (128). Truncating 873 tokens from the left.
Running loglikelihood requests: 0% 65/56168 [01:59<26:49:48, 1.72s/it]WARNING:lm_eval.models.huggingface:Combined length of context (999) and continuation (1) exceeds model's maximum length (128). Truncating 873 tokens from the left.
Running loglikelihood requests: 0% 69/56168 [02:07<27:32:59, 1.77s/it]WARNING:lm_eval.models.huggingface:Combined length of context (996) and continuation (1) exceeds model's maximum length (128). Truncating 870 tokens from the left.
Running loglikelihood requests: 0% 73/56168 [02:13<26:51:48, 1.72s/it]WARNING:lm_eval.models.huggingface:Combined length of context (995) and continuation (1) exceeds model's maximum length (128). Truncating 869 tokens from the left.
Running loglikelihood requests: 0% 77/56168 [02:21<27:30:05, 1.77s/it]WARNING:lm_eval.models.huggingface:Combined length of context (994) and continuation (1) exceeds model's maximum length (128). Truncating 868 tokens from the left.
Running loglikelihood requests: 0% 81/56168 [02:27<26:51:14, 1.72s/it]WARNING:lm_eval.models.huggingface:Combined length of context (990) and continuation (1) exceeds model's maximum length (128). Truncating 864 tokens from the left.
Running loglikelihood requests: 0% 85/56168 [02:35<27:38:47, 1.77s/it]WARNING:lm_eval.models.huggingface:Combined length of context (990) and continuation (1) exceeds model's maximum length (128). Truncating 864 tokens from the left.
Running loglikelihood requests: 0% 89/56168 [02:41<26:56:56, 1.73s/it]WARNING:lm_eval.models.huggingface:Combined length of context (989) and continuation (1) exceeds model's maximum length (128). Truncating 863 tokens from the left.
...
5-shot:
INFO:lm_eval.evaluator:Running loglikelihood requests
Running loglikelihood requests: 0% 0/56168 [00:00<?, ?it/s]WARNING:lm_eval.models.huggingface:Combined length of context (3081) and continuation (1) exceeds model's maximum length (128). Truncating 2955 tokens from the left.
Running loglikelihood requests: 0% 1/56168 [00:08<135:57:17, 8.71s/it]WARNING:lm_eval.models.huggingface:Combined length of context (3062) and continuation (1) exceeds model's maximum length (128). Truncating 2936 tokens from the left.
Running loglikelihood requests: 0% 5/56168 [00:15<42:45:07, 2.74s/it] WARNING:lm_eval.models.huggingface:Combined length of context (3049) and continuation (1) exceeds model's maximum length (128). Truncating 2923 tokens from the left.
Running loglikelihood requests: 0% 9/56168 [00:22<34:24:05, 2.21s/it]WARNING:lm_eval.models.huggingface:Combined length of context (3033) and continuation (1) exceeds model's maximum length (128). Truncating 2907 tokens from the left.
Running loglikelihood requests: 0% 13/56168 [00:29<31:00:33, 1.99s/it]WARNING:lm_eval.models.huggingface:Combined length of context (3027) and continuation (1) exceeds model's maximum length (128). Truncating 2901 tokens from the left.
Running loglikelihood requests: 0% 17/56168 [00:36<29:24:45, 1.89s/it]WARNING:lm_eval.models.huggingface:Combined length of context (3006) and continuation (1) exceeds model's maximum length (128). Truncating 2880 tokens from the left.
Running loglikelihood requests: 0% 21/56168 [00:43<28:26:43, 1.82s/it]WARNING:lm_eval.models.huggingface:Combined length of context (2985) and continuation (1) exceeds model's maximum length (128). Truncating 2859 tokens from the left.
Running loglikelihood requests: 0% 25/56168 [00:49<27:48:19, 1.78s/it]WARNING:lm_eval.models.huggingface:Combined length of context (2981) and continuation (1) exceeds model's maximum length (128). Truncating 2855 tokens from the left.
Running loglikelihood requests: 0% 29/56168 [00:57<27:42:46, 1.78s/it]WARNING:lm_eval.models.huggingface:Combined length of context (2979) and continuation (1) exceeds model's maximum length (128). Truncating 2853 tokens from the left.
Running loglikelihood requests: 0% 33/56168 [01:03<27:13:25, 1.75s/it]WARNING:lm_eval.models.huggingface:Combined length of context (2973) and continuation (1) exceeds model's maximum length (128). Truncating 2847 tokens from the left.
Running loglikelihood requests: 0% 37/56168 [01:10<27:29:57, 1.76s/it]WARNING:lm_eval.models.huggingface:Combined length of context (2972) and continuation (1) exceeds model's maximum length (128). Truncating 2846 tokens from the left.
Running loglikelihood requests: 0% 41/56168 [01:17<26:50:56, 1.72s/it]WARNING:lm_eval.models.huggingface:Combined length of context (2972) and continuation (1) exceeds model's maximum length (128). Truncating 2846 tokens from the left.
Running loglikelihood requests: 0% 45/56168 [01:24<27:13:43, 1.75s/it]WARNING:lm_eval.models.huggingface:Combined length of context (2969) and continuation (1) exceeds model's maximum length (128). Truncating 2843 tokens from the left.
...
Let us check mmlu-llama benchmark, 0 and 5 shots, we also need to decide on input & output sequence lengths
How about we use perplexity to measure the accuracy, similar to this ExecuTorch example for Llama 3.1 8B: using LM_EVAL, and using similar settings in this example of max input sequence of 2048, and output of 1000 as them "quoting from here https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md" : "We evaluated WikiText perplexity using LM Eval. Below are the results for two different groupsizes, with max_seq_length 2048, and limit 1000.
Model | Baseline (FP32) | Groupwise 4-bit (128) | Groupwise 4-bit (256) |
---|---|---|---|
Llama 3 8B | 7.9 | 9.4 | 9.7 |
"
evaluated with
lm_eval --model hf --model_args pretrained=meta-llama/... --tasks mmlu --num_fewshot 5
on Colab (w/ L4 GPU)model MMLU (5-shot) 3.2 1B Instruct 0.4557 ± 0.0041 3.2 3B Instruct 0.5959 ±0.0040 3.1 8B Instruct 0.6820 ± 0.0037
evaluated with lm_eval --model hf --model_args pretrained=meta-llama/... --tasks mmlu_llama --num_fewshot 5
on Colab (w/ L4 GPU)
model | MMLU (5-shot) |
---|---|
3.2 1B Instruct | 0.4607 ± 0.0041 |
3.2 3B Instruct | 0.6173 ± 0.0039 |
3.1 8B Instruct | 0.6840 ± 0.0037 |
How about we use perplexity to measure the accuracy, similar to this ExecuTorch example for Llama 3.1 8B: using LM_EVAL, and using similar settings in this example of max input sequence of 2048, and output of 1000 as them "quoting from here https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md" : "We evaluated WikiText perplexity using LM Eval. Below are the results for two different groupsizes, with max_seq_length 2048, and limit 1000.
Model Baseline (FP32) Groupwise 4-bit (128) Groupwise 4-bit (256) Llama 3 8B 7.9 9.4 9.7 "
This reference also show lm-eval Perplexity numbers of (non-instruct) 3.1 8B, (seems no constraint on context length as above) with different torchao quantizations, https://raw.githubusercontent.com/pytorch/ao/main/torchao/quantization/README.md
I’m not certain if summarization is the desired task for generative language models, and if perplexity is an appropriate metric.
@mohitmundhragithub @Mostelk
lm-eval , perplexity on wikitext for meta-llama/Llama-3.1-8B-Instruct | Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|---|
wikitext | 2 | none | 0 | bits_per_byte | ↓ | 0.5891 | ± | N/A | |
none | 0 | byte_perplexity | ↓ | 1.5043 | ± | N/A | |||
none | 0 | word_perplexity | ↓ | 8.8784 | ± | N/A |
- let's check more on what WikiText perplexity is.
- client group: 2k for both input and outputs for previous submission, going for 4k now.
- let's start with 2048/1024 for inputs/outputs Let's try to report report both MMLU and wikiText perplexity.
mmlu_lamma not mmlu & word perplexity (lm-eval)
aswin: please share the client working group mmlu evaluation script, so that we can have the same tool. and the numbers you shared to the client working group, please.
@AhmedTElthakeb mmlu-llama with 2K input length, 0-shot, 1-shot, and 5-shot.
@freedomtan @Mostelk PFA, the link of the mmlu benchmark script folder used in client development
Things to note: 1) The script is tuned to work for MLperf client application, we might have to tune it according to our app. 2) Also, for the script to calculate the scores, the result dump should be in a MLperf client result format (if not, the script needs to be changed)
Attaching sample output format of client for reference
Regarding the model scores for client, the submission of all participants for 0.5 and 0.6 are listed here results
1.0 submission final date is this Friday (20/6), so latest scores (for llama2, llama3.1-8B and phi-3.5) are not added in this link, will be available post that date.
Results for meta-llama/Llama-3.1-8B-Instruct with 2K context len:
==== 5 shot ===== |
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|---|
mmlu_llama | 1 | strict_match | exact_match | ↑ | 0.6937 | ± | 0.0037 | ||
- humanities | 1 | strict_match | exact_match | ↑ | 0.6574 | ± | 0.0067 | ||
- other | 1 | strict_match | exact_match | ↑ | 0.7461 | ± | 0.0075 | ||
- social sciences | 1 | strict_match | exact_match | ↑ | 0.7862 | ± | 0.0073 | ||
- stem | 0 | strict_match | exact_match | ↑ | 0.6061 | ± | 0.0084 |
==== 1 shot ===== | Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|---|
mmlu_llama | 1 | strict_match | exact_match | ↑ | 0.6858 | ± | 0.0037 | ||
- humanities | 1 | strict_match | exact_match | ↑ | 0.6540 | ± | 0.0067 | ||
- other | 1 | strict_match | exact_match | ↑ | 0.7354 | ± | 0.0075 | ||
- social sciences | 1 | strict_match | exact_match | ↑ | 0.7784 | ± | 0.0074 | ||
- stem | 0 | strict_match | exact_match | ↑ | 0.5940 | ± | 0.0084 |
==== 0 shot ===== | Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|---|
mmlu_llama | 1 | strict_match | exact_match | ↑ | 0.6892 | ± | 0.0037 | ||
- humanities | 1 | strict_match | exact_match | ↑ | 0.6582 | ± | 0.0067 | ||
- other | 1 | strict_match | exact_match | ↑ | 0.7435 | ± | 0.0075 | ||
- social sciences | 1 | strict_match | exact_match | ↑ | 0.7748 | ± | 0.0074 | ||
- stem | 0 | strict_match | exact_match | ↑ | 0.5985 | ± | 0.0084 |
Command:
lm_eval \
--model vllm \
--model_args pretrained=$1,dtype=auto,max_model_len=2048,max_gen_toks=10,tensor_parallel_size=1,enable_prefix_caching=True \
--tasks mmlu_llama \
--fewshot_as_multiturn \
--apply_chat_template \
--num_fewshot 5 \
--batch_size auto
Please check this PR https://github.com/EleutherAI/lm-evaluation-harness/pull/2797
lm-eval version: 0.4.8
@freedomtan @Mostelk PFA, the link of the mmlu benchmark script folder used in client development
Things to note:
- The script is tuned to work for MLperf client application, we might have to tune it according to our app.
- Also, for the script to calculate the scores, the result dump should be in a MLperf client result format (if not, the script needs to be changed)
Attaching sample output format of client for reference
@Aswinoss please provide the link to the MMLU eval script from https://github.com/mlcommons/mlperf_client_dev
Results for
meta-llama/Llama-3.1-8B-Instruct
with 2K context len: ==== 5 shot =====Groups Version Filter n-shot Metric Value Stderr mmlu_llama 1 strict_match exact_match ↑ 0.6937 ± 0.0037
- humanities 1 strict_match exact_match ↑ 0.6574 ± 0.0067
- other 1 strict_match exact_match ↑ 0.7461 ± 0.0075
- social sciences 1 strict_match exact_match ↑ 0.7862 ± 0.0073
- stem 0 strict_match exact_match ↑ 0.6061 ± 0.0084 ==== 1 shot =====
Groups Version Filter n-shot Metric Value Stderr mmlu_llama 1 strict_match exact_match ↑ 0.6858 ± 0.0037
- humanities 1 strict_match exact_match ↑ 0.6540 ± 0.0067
- other 1 strict_match exact_match ↑ 0.7354 ± 0.0075
- social sciences 1 strict_match exact_match ↑ 0.7784 ± 0.0074
- stem 0 strict_match exact_match ↑ 0.5940 ± 0.0084 ==== 0 shot =====
Groups Version Filter n-shot Metric Value Stderr mmlu_llama 1 strict_match exact_match ↑ 0.6892 ± 0.0037
- humanities 1 strict_match exact_match ↑ 0.6582 ± 0.0067
- other 1 strict_match exact_match ↑ 0.7435 ± 0.0075
- social sciences 1 strict_match exact_match ↑ 0.7748 ± 0.0074
- stem 0 strict_match exact_match ↑ 0.5985 ± 0.0084 Command:
lm_eval \ --model vllm \ --model_args pretrained=$1,dtype=auto,max_model_len=2048,max_gen_toks=10,tensor_parallel_size=1,enable_prefix_caching=True \ --tasks mmlu_llama \ --fewshot_as_multiturn \ --apply_chat_template \ --num_fewshot 5 \ --batch_size auto
Please check this PR EleutherAI/lm-evaluation-harness#2797
lm-eval version: 0.4.8
—fewshot_as_multiturn
is to provide fewshot examples as a multi-turn conversation, that is to chop examples into small chunks. We need to discuss if that's what we want to do.
@freedomtan @Mostelk PFA, the link of the mmlu benchmark script folder used in client development mmlu client script Things to note:
- The script is tuned to work for MLperf client application, we might have to tune it according to our app.
- Also, for the script to calculate the scores, the result dump should be in a MLperf client result format (if not, the script needs to be changed)
Attaching sample output format of client for reference results.json
@Aswinoss please provide the link to the MMLU eval script from https://github.com/mlcommons/mlperf_client_dev
My bad. This is the external link for same MMLU_client_dev
@freedomtan to find the exact mmlu_llama parameters used by lm_eval.
Let's check exact what does the --fewshot_as_multiturn
do and if we want to have something like it does.
It's like basically, for something 5-shot, larger input tensors are chopped into smaller chunk.
Let's check exact what does the
--fewshot_as_multiturn
do and if we want to have something like it does.It's like basically, for something 5-shot, larger input tensors are chopped into smaller chunk.
I read through lm_eval code, it seems my understanding of --fewshot_as_multiturn
was almost totally wrong. what does it do
mmlu_llama
It seems to the most important difference between mmlu
and mmlu_llama
is how the prompts are articulated.
mmlu_llama on device
fewshots as multiturn
lm_eval: that's fine. on-device: we need implement them if we want to use them.
for currently non-on-device evaluation, we can simple use lm_eval + mmlu_llama and fewshot-as-multiturn (5-shot)
.
Let's do 3.1-8B, 3.2-3B, and maybe 3.2-1B (Instruct ones, because for the base models more instructions in the prompts needed). Let's try to report quant numbers for these.
convert pytorch model -> quantized tflite model (non-standard one), it's possible to run them on x86_machine, and we have Python API -> .dla format
(https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/evaluator.py#L49)
Q: pytorch model -> onnx (floating point) -> qunatized qnn binary (not runnable on x86). how about: lm_eval on Windows on ARM (let's check if this works (@mohitmundhragithub and @Aswinoss)
We were investigating this method. In our client app, the llama model runs using the GENIE sdk which has the necessary inference pipeline in built to run llama. The bins are created after quantization to be support by this pipeline.
lm_eval (python) will not be able to support GENIE as GENIE provides only C++ APIs.
We shall discuss more on this in tomorrow's meeting.
convert pytorch model -> quantized tflite model (non-standard one), it's possible to run them on x86_machine, and we have Python API -> .dla format
(https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/evaluator.py#L49)
Q: pytorch model -> onnx (floating point) -> qunatized qnn binary (not runnable on x86). how about: lm_eval on Windows on ARM (let's check if this works (@mohitmundhragithub and @Aswinoss)
mlperf_client
(MMLU and perplexity numbers close enough to what reported by lm_eval).Results for
meta-llama/Llama-3.1-8B-Instruct
with 2K context len: ==== 5 shot =====Groups Version Filter n-shot Metric Value Stderr mmlu_llama 1 strict_match exact_match ↑ 0.6937 ± 0.0037
- humanities 1 strict_match exact_match ↑ 0.6574 ± 0.0067
- other 1 strict_match exact_match ↑ 0.7461 ± 0.0075
- social sciences 1 strict_match exact_match ↑ 0.7862 ± 0.0073
- stem 0 strict_match exact_match ↑ 0.6061 ± 0.0084 ==== 1 shot =====
Groups Version Filter n-shot Metric Value Stderr mmlu_llama 1 strict_match exact_match ↑ 0.6858 ± 0.0037
- humanities 1 strict_match exact_match ↑ 0.6540 ± 0.0067
- other 1 strict_match exact_match ↑ 0.7354 ± 0.0075
- social sciences 1 strict_match exact_match ↑ 0.7784 ± 0.0074
- stem 0 strict_match exact_match ↑ 0.5940 ± 0.0084 ==== 0 shot =====
Groups Version Filter n-shot Metric Value Stderr mmlu_llama 1 strict_match exact_match ↑ 0.6892 ± 0.0037
- humanities 1 strict_match exact_match ↑ 0.6582 ± 0.0067
- other 1 strict_match exact_match ↑ 0.7435 ± 0.0075
- social sciences 1 strict_match exact_match ↑ 0.7748 ± 0.0074
- stem 0 strict_match exact_match ↑ 0.5985 ± 0.0084 Command:
lm_eval \ --model vllm \ --model_args pretrained=$1,dtype=auto,max_model_len=2048,max_gen_toks=10,tensor_parallel_size=1,enable_prefix_caching=True \ --tasks mmlu_llama \ --fewshot_as_multiturn \ --apply_chat_template \ --num_fewshot 5 \ --batch_size auto
Please check this PR EleutherAI/lm-evaluation-harness#2797
lm-eval version: 0.4.8
Client working group used 62 threshold for client MMLU test Llama 2 7B - 43 Llama 3.1 8B Instruct - 62 Phi 3.5 Mini Instruct - 59 Phi 4 Reasoning 14B (Exp) - 70
Which "accuracy" metric(s) should we use for LLM benchmarking?