Applying the recent changes made by @tlrmchlsmth and @mgoin to the DeepSparse wrapper for lm-eval.
When this PR lands, we will be able to successfully test our pipelines using lm-eval on tasks that both test for perplexity, as well generation.
Testing
Succesfully ran pytest tests/deepsparse/evaluation/, including the tests that require lm-eval dependency.
[x] Also test manually on a llama model, just to be sure that generation is working as expected.
2023-11-16 15:42:16 __main__ INFO Target to evaluate: hf:neuralmagic/mpt-7b-gsm8k-pruned60-quant-ds
2023-11-16 15:42:16 __main__ INFO A pipeline with the engine type: deepsparse will be created
2023-11-16 15:42:16 __main__ INFO Datasets to evaluate on: gsm8k
Batch size: 1
Splits to evaluate on: None
Metrics to evaluate on: None
Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 89451.98it/s]
2023-11-16 15:42:17 deepsparse.transformers.pipelines.text_generation INFO Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.6.0.20231115 COMMUNITY | (8ac4188c) (release) (optimized) (system=avx2, binary=avx2)
[7fcdcad6b700 >WARN< operator() ./src/include/wand/utility/warnings.hpp:14] Generating emulated code for quantized (INT8) operations since no VNNI instructions were detected. Set NM_FAST_VNNI_EMULATION=1 to increase performance at the expense of accuracy.
Task: gsm8k; number of docs: 1319
Task: gsm8k; document 0; context prompt (starting on next line):
Question: Jared is trying to increase his typing speed. He starts with 47 words per minute (WPM). After some lessons the next time he tests his typing speed it has increased to 52 WPM. If he continues to increase his typing speed once more by 5 words, what will be the average of the three measurements?
Answer:
(end of prompt on previous line)
Requests: Req_greedy_until('Question: Jared is trying to increase his typing speed. He starts with 47 words per minute (WPM). After some lessons the next time he tests his typing speed it has increased to 52 WPM. If he continues to increase his typing speed once more by 5 words, what will be the average of the three measurements?\nAnswer:', {'until': [':', 'Question:', 'Question']})[None]
Running greedy_until requests
0it [00:00, ?it/s]
2023-11-16 15:42:59 __main__ INFO Evaluation done. Results:
[
{
"task": "llm_evaluation_harness",
"dataset": {
"type": null,
"name": "gsm8k",
"config": {
"model": null,
"model_args": "",
"num_fewshot": 0,
"batch_size": 1,
"batch_sizes": [],
"device": null,
"no_cache": false,
"limit": 2,
"bootstrap_iters": 100000,
"description_dict": {}
},
"split": null
},
"metrics": [
{
"name": "acc",
"value": 0.5
},
{
"name": "acc_stderr",
"value": 0.5
}
],
"samples": null
}
]
2023-11-16 15:42:59 __main__ INFO Saving the evaluation results to /home/ubuntu/damian/deepsparse/src/deepsparse/evaluation/result.json
Feature Description
Applying the recent changes made by @tlrmchlsmth and @mgoin to the
DeepSparse
wrapper forlm-eval
. When this PR lands, we will be able to successfully test our pipelines usinglm-eval
on tasks that both test for perplexity, as well generation.Testing
Succesfully ran
pytest tests/deepsparse/evaluation/
, including the tests that requirelm-eval
dependency.(running with
--limit 2
)