[Evaluator][Feature Branch] Make `llm_evaluation_harness` integration fully functional

Feature Description

Applying the recent changes made by @tlrmchlsmth and @mgoin to the DeepSparse wrapper for lm-eval. When this PR lands, we will be able to successfully test our pipelines using lm-eval on tasks that both test for perplexity, as well generation.

Testing

Succesfully ran pytest tests/deepsparse/evaluation/, including the tests that require lm-eval dependency.

[x] Also test manually on a llama model, just to be sure that generation is working as expected.

 python src/deepsparse/evaluation/evaluator.py --target hf:neuralmagic/mpt-7b-gsm8k-pruned60-quant-ds --datasets gsm8k -i llm_evaluation_harness

(running with --limit 2)

2023-11-16 15:42:16 __main__     INFO     Target to evaluate: hf:neuralmagic/mpt-7b-gsm8k-pruned60-quant-ds
2023-11-16 15:42:16 __main__     INFO     A pipeline with the engine type: deepsparse will be created
2023-11-16 15:42:16 __main__     INFO     Datasets to evaluate on: gsm8k
Batch size: 1
Splits to evaluate on: None
Metrics to evaluate on: None
Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 89451.98it/s]
2023-11-16 15:42:17 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.6.0.20231115 COMMUNITY | (8ac4188c) (release) (optimized) (system=avx2, binary=avx2)
[7fcdcad6b700 >WARN<  operator() ./src/include/wand/utility/warnings.hpp:14] Generating emulated code for quantized (INT8) operations since no VNNI instructions were detected. Set NM_FAST_VNNI_EMULATION=1 to increase performance at the expense of accuracy.
Task: gsm8k; number of docs: 1319
Task: gsm8k; document 0; context prompt (starting on next line):
Question: Jared is trying to increase his typing speed. He starts with 47 words per minute (WPM). After some lessons the next time he tests his typing speed it has increased to 52 WPM. If he continues to increase his typing speed once more by 5 words, what will be the average of the three measurements?
Answer:
(end of prompt on previous line)
Requests: Req_greedy_until('Question: Jared is trying to increase his typing speed. He starts with 47 words per minute (WPM). After some lessons the next time he tests his typing speed it has increased to 52 WPM. If he continues to increase his typing speed once more by 5 words, what will be the average of the three measurements?\nAnswer:', {'until': [':', 'Question:', 'Question']})[None]

Running greedy_until requests
0it [00:00, ?it/s]
2023-11-16 15:42:59 __main__     INFO     Evaluation done. Results:
[
    {
        "task": "llm_evaluation_harness",
        "dataset": {
            "type": null,
            "name": "gsm8k",
            "config": {
                "model": null,
                "model_args": "",
                "num_fewshot": 0,
                "batch_size": 1,
                "batch_sizes": [],
                "device": null,
                "no_cache": false,
                "limit": 2,
                "bootstrap_iters": 100000,
                "description_dict": {}
            },
            "split": null
        },
        "metrics": [
            {
                "name": "acc",
                "value": 0.5
            },
            {
                "name": "acc_stderr",
                "value": 0.5
            }
        ],
        "samples": null
    }
]
2023-11-16 15:42:59 __main__     INFO     Saving the evaluation results to /home/ubuntu/damian/deepsparse/src/deepsparse/evaluation/result.json

neuralmagic / deepsparse

[Evaluator][Feature Branch] Make `llm_evaluation_harness` integration fully functional #1411

Feature Description

Testing