mlabonne / llm-autoeval

Automatically evaluate your LLMs in Google Colab
MIT License
460 stars 77 forks source link

🧐 LLM AutoEval

🐦 Follow me on X • 🤗 Hugging Face • 💻 Blog • 📙 Hands-on GNN

Simplify LLM evaluation using a convenient Colab notebook.

Open In Colab


🔍 Overview

LLM AutoEval simplifies the process of evaluating LLMs using a convenient Colab notebook. You just need to specify the name of your model, a benchmark, a GPU, and press run!

Key Features

View a sample summary here.

Note: This project is in the early stages and primarily designed for personal use. Use it carefully and feel free to contribute.

⚡ Quick Start

Evaluation

Cloud GPU

Tokens

Tokens use Colab's Secrets tab. Create two secrets called "runpod" and "github" and add the corresponding tokens you can find as follows:

📊 Benchmark suites

Nous

You can compare your results with:

Lighteval

You can compare your results on a case-by-case basis, depending on the tasks you have selected.

Open LLM

You can compare your results with those listed on the Open LLM Leaderboard.

🏆 Leaderboard

I use the summaries produced by LLM AutoEval to created YALL - Yet Another LLM Leaderboard with plots as follows:

image

Let me know if you're interested in creating your own leaderboard with your gists in one click. This can be easily converted into a small notebook to create this space.

🛠️ Troubleshooting

Acknowledgements

Special thanks to burtenshaw for integrating lighteval, EleutherAI for the lm-evaluation-harness, dmahan93 for his fork that adds agieval to the lm-evaluation-harness, Hugging Face for the lighteval library, NousResearch and Teknium for the Nous benchmark suite, and vllm for the additional inference speed.