[Feature Branch][DeepSparse Evaluation API] Update lm-eval, perplexity, additional datasets

This PR updates the version of the lm-eval from 0.3 to 0.4. Supported and tested datasets to evaluate on gsm8k, hellaswag, arc_challange.

Example usage

Example using CLI (when lm-eval is not installed):

 deepsparse.eval hf:mgoin/TinyStories-1M-ds --dataset hellaswag --dataset arc_challange --limit 2

2024-02-05 13:27:51 deepsparse.evaluation.cli INFO     Creating deepsparse pipeline to evaluate from model path: hf:mgoin/TinyStories-1M-ds
2024-02-05 13:27:51 deepsparse.evaluation.cli INFO     Datasets to evaluate on: ['hellaswag', 'arc_challange']
Batch size: 1
Splits to evaluate on: None
Metrics to evaluate on: None
Additional integration arguments supplied: {'limit': 2}
Fetching 11 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 164189.84it/s]
Fetching 11 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 54535.87it/s]
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.0.20240104 COMMUNITY | (86c38139) (release) (optimized) (system=avx2, binary=avx2)
Fetching 11 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 92459.61it/s]
2024-02-05 13:27:54 deepsparse.evaluation.registry INFO     No integration specified, inferring the evaluation function from the input arguments...
2024-02-05 13:27:54 deepsparse.evaluation.registry INFO     Inferred the evaluation function: lm-evaluation-harness
Traceback (most recent call last):
  File "/nm/drive0/damian/deepsparse/src/deepsparse/evaluation/integrations/__init__.py", line 20, in try_import_lm_evaluation_harness
    import lm_eval
ModuleNotFoundError: No module named 'lm_eval'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nm/drive0/damian/deepsparse/deepsparse_venv/bin/deepsparse.eval", line 8, in <module>
    sys.exit(main())
  File "/nm/drive0/damian/deepsparse/deepsparse_venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/nm/drive0/damian/deepsparse/deepsparse_venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/nm/drive0/damian/deepsparse/deepsparse_venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/nm/drive0/damian/deepsparse/deepsparse_venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/nm/drive0/damian/deepsparse/src/deepsparse/evaluation/cli.py", line 193, in main
    result: Result = evaluate(
  File "/nm/drive0/damian/deepsparse/src/deepsparse/evaluation/evaluator.py", line 63, in evaluate
    eval_integration = EvaluationRegistry.resolve(pipeline, datasets, integration)
  File "/nm/drive0/damian/deepsparse/src/deepsparse/evaluation/registry.py", line 72, in resolve
    potentially_check_dependency_import(integration)
  File "/nm/drive0/damian/deepsparse/src/deepsparse/evaluation/utils.py", line 46, in potentially_check_dependency_import
    try_import_lm_evaluation_harness(raise_error=True)
  File "/nm/drive0/damian/deepsparse/src/deepsparse/evaluation/integrations/__init__.py", line 25, in try_import_lm_evaluation_harness
    raise ImportError(
ImportError: Unable to import lm_eval. To install run 'pip install lm-eval==0.4.0'

Example using CLI

 deepsparse.eval hf:mgoin/TinyStories-1M-ds --dataset hellaswag --dataset arc_challange --limit 2

2024-02-05 13:24:42 deepsparse.evaluation.cli INFO     Creating deepsparse pipeline to evaluate from model path: hf:mgoin/TinyStories-1M-ds
2024-02-05 13:24:42 deepsparse.evaluation.cli INFO     Datasets to evaluate on: ['hellaswag', 'arc_challange']
Batch size: 1
Splits to evaluate on: None
Metrics to evaluate on: None
Additional integration arguments supplied: {'limit': 2}
Fetching 11 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 39911.20it/s]
Fetching 11 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 20042.29it/s]
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.0.20240104 COMMUNITY | (86c38139) (release) (optimized) (system=avx2, binary=avx2)
Fetching 11 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 31906.88it/s]
2024-02-05 13:24:45 deepsparse.evaluation.registry INFO     No integration specified, inferring the evaluation function from the input arguments...
2024-02-05 13:24:45 deepsparse.evaluation.registry INFO     Inferred the evaluation function: lm-evaluation-harness
2024-02-05:13:24:49,100 WARNING  [__init__.py:194] Some tasks could not be loaded due to missing dependencies. Run with `--verbosity DEBUG` for full details.
2024-02-05:13:24:51,939 WARNING  [__init__.py:194] Some tasks could not be loaded due to missing dependencies. Run with `--verbosity DEBUG` for full details.
2024-02-05 13:24:51 deepsparse.evaluation.integrations.lm_evaluation_harness INFO     Selected Tasks: ['hellaswag']
2024-02-05:13:24:51,940 INFO     [lm_evaluation_harness.py:67] Selected Tasks: ['hellaswag']
2024-02-05:13:24:55,591 INFO     [task.py:340] Building contexts for task on rank 0...
2024-02-05:13:24:55,592 INFO     [evaluator.py:319] Running loglikelihood requests
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:11<00:00,  8.98s/it]
2024-02-05 13:26:07 deepsparse.evaluation.cli INFO     Evaluation done. Results:
formatted=[Evaluation(task='lm_evaluation_harness', dataset=Dataset(type=None, name='hellaswag', config={'model': 'DeepSparseLM', 'model_args': None, 'batch_size': 1, 'batch_sizes': [], 'device': None, 'use_cache': None, 'limit': 2, 'bootstrap_iters': 100000, 'gen_kwargs': None}, split=None), metrics=[Metric(name='acc,none', value=0.0), Metric(name='acc_stderr,none', value=0.0), Metric(name='acc_norm,none', value=1.0), Metric(name='acc_norm_stderr,none', value=0.0)], samples=None)]
2024-02-05 13:26:07 deepsparse.evaluation.cli INFO     Saving the evaluation results to /nm/drive0/damian/deepsparse/result.json
2024-02-05:13:26:07,507 INFO     [cli.py:212] Saving the evaluation results to /nm/drive0/damian/deepsparse/result.json

Example using `evaluate` function:

from deepsparse import evaluate

out = evaluate(model="hf:mgoin/TinyStories-1M-ds",
         datasets=["hellaswag", "arc_challenge"], 
          limit = 2)
print(out)

Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 131820.98it/s]
Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 151767.58it/s]
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.0.20240104 COMMUNITY | (86c38139) (release) (optimized) (system=avx2, binary=avx2)
Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 35654.83it/s]
2024-02-05 13:09:34 deepsparse.evaluation.registry INFO     No integration specified, inferring the evaluation function from the input arguments...
2024-02-05 13:09:34 deepsparse.evaluation.registry INFO     Inferred the evaluation function: lm-evaluation-harness
2024-02-05:13:09:38,769 WARNING  [__init__.py:194] Some tasks could not be loaded due to missing dependencies. Run with `--verbosity DEBUG` for full details.
2024-02-05:13:09:41,599 WARNING  [__init__.py:194] Some tasks could not be loaded due to missing dependencies. Run with `--verbosity DEBUG` for full details.
2024-02-05 13:09:41 deepsparse.evaluation.integrations.lm_evaluation_harness INFO     Selected Tasks: ['arc_challenge', 'hellaswag']
2024-02-05:13:09:41,601 INFO     [lm_evaluation_harness.py:67] Selected Tasks: ['arc_challenge', 'hellaswag']
2024-02-05:13:09:48,822 INFO     [task.py:340] Building contexts for task on rank 0...
2024-02-05:13:09:48,829 INFO     [task.py:340] Building contexts for task on rank 0...
2024-02-05:13:09:48,832 INFO     [evaluator.py:319] Running loglikelihood requests
100%|██████████| 16/16 [05:34<00:00, 20.92s/it]
formatted=[Evaluation(task='lm_evaluation_harness', dataset=Dataset(type=None, name='arc_challenge', config={'model': 'DeepSparseLM', 'model_args': None, ...

Example running unit tests (requires `lm-eval==0.4` to be installed)

damian@gpuserver6:/nm/drive0/damian/deepsparse$ pytest tests/deepsparse/evaluation/integrations/test_lm_evaluation_harness.py 
================================================================================================================================ test session starts ================================================================================================================================
platform linux -- Python 3.10.12, pytest-7.4.3, pluggy-1.3.0
rootdir: /nm/drive0/damian/deepsparse
configfile: pyproject.toml
plugins: flaky-3.7.0, anyio-3.7.1
collected 8 items                                                                                                                                                                                                                                                                   

tests/deepsparse/evaluation/integrations/test_lm_evaluation_harness.py ........                                                                                                                                                                                               [100%]

==================================================================================================================== 8 passed, 19 warnings in 302.35s (0:05:02) =====================================================================================================================

neuralmagic / deepsparse

[Feature Branch][DeepSparse Evaluation API] Update lm-eval, perplexity, additional datasets #1580

Example usage

Example using CLI (when lm-eval is not installed):

Example using CLI

Example using `evaluate` function:

Example running unit tests (requires `lm-eval==0.4` to be installed)

neuralmagic / deepsparse

[Feature Branch][DeepSparse Evaluation API] Update lm-eval, perplexity, additional datasets #1580

Example usage

Example using CLI (when lm-eval is not installed):

Example using CLI

Example using evaluate function:

Example running unit tests (requires lm-eval==0.4 to be installed)

Example using `evaluate` function:

Example running unit tests (requires `lm-eval==0.4` to be installed)