Benchmark evaluation for language models.

huggingface / evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

https://huggingface.co/docs/evaluate

Apache License 2.0

2.04k stars 260 forks source link

Benchmark evaluation for language models. #615

Open mina58 opened 3 months ago

mina58 commented 3 months ago

Not sure if this feature belongs to this library or would it require a complete separate library. I am proposing the creation of a library where llm benchmarks can be ran. For example, evaluating a model on HumanEval. Such library would make evaluating LLMs much easier. I liked the dockerized approach they used at https://github.com/NVlabs/verilog-eval to safely evaluate code. Evaluating math and reasoning skill of llm could also be benificial.

Vipitis commented 3 months ago

consider, which includes a sandbox to run HumanEval (on Linux only) https://github.com/bigcode-project/bigcode-evaluation-harness