LLM Evaluation Tutorials with Evalverse

jihoo-kim commented 5 months ago

Suggest for LLM Evaluation Tutorials with `Evalverse`

Tutorials (Notebook examples): https://github.com/UpstageAI/evalverse/tree/main/examples
- 01_basic_usage.ipynb
- 02_advanced_usage.ipynb
- Open LLM Leaderboard (h6_en)
- MT-Bench
- IFEval (Instruction Following Evaluation)
- EQ-Bench

Evalverse
- Github: https://github.com/UpstageAI/evalverse
- Docs: https://evalverse.gitbook.io/evalverse-docs
- Paper: https://arxiv.org/abs/2404.00943
- HuggingFace: https://huggingface.co/spaces/upstage/evalverse-space
- Articles
- https://www.upstage.ai/feed/tech/evalverse-llm-evaluation-opensource
- https://huggingface.co/blog/Yescia/evalverse-llm-evaluation-opensource

mlabonne commented 5 months ago

Hey thanks for the suggestion, this is quite exciting. I've been looking for something like this for a while.

I've tried it yesterday and I ran into some issues:

Results couldn't be written on disk (at least for MT-Bench and EQ-Bench), which means I lost my MT-Bench's results
I couldn't use EQ-Bench without a default chat template (ChatML seems to be selected by default), which meant I couldn't use Llama 3's chat template

In general, I would really appreciate if we could have an example with Llama 3.

jihoo-kim commented 5 months ago

Thanks for accepting my suggestion and trying it. @mlabonne

Issue 1

Results couldn't be written on disk (at least for MT-Bench and EQ-Bench), which means I lost my MT-Bench's results

Could you tell me what script you ran it with? If you specify the output_path argument, the results would be saved on the disk. The default values of output_path is the directory where evalverse is placed.

Please try again with your own output_path.

CLI

python3 evaluator.py \
    --ckpt_path {your_model} \
    --mt_bench \
    --num_gpus_total 8 \
    --parallel_api 4 \
    --output_path {your_path}

Library

import evalverse as ev

evaluator = ev.Evaluator()
evaluator.run(
    model={your_model},
    benchmark="mt_bench",
    num_gpus_total=8,
    parallel_api=4,
    output_path={your_path}
)

Issue 2

I couldn't use EQ-Bench without a default chat template (ChatML seems to be selected by default), which meant I couldn't use Llama 3's chat template

I will fix it as soon as possible and let you know again. Thank you!

mlabonne / llm-course