huggingface / lighteval

LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
MIT License
462 stars 53 forks source link

Performance compared to lm-evaluation-harness #179

Open geoalgo opened 2 months ago

geoalgo commented 2 months ago

Hi,

Thanks for sharing this package, it has lots of cool features!

I saw that arc-challenge was taking about twice longer that what I have with harness, I ran the following commands with lighteval:

# ran with 4A100 GPUs -> 611s
time  accelerate launch --multi_gpu --num_processes=4 lighteval/run_evals_accelerate.py --model_args="pretrained=meta-llama/Meta-Llama-3-8B" --tasks "leaderboard|arc:challenge|25|0" --output_dir "arc_challenge" --override batch_size 8

and the following command from harness (using big-refactor branch):

# ran with 4A100 GPUs -> 319s and big-refactor branch
time accelerate launch -m lm_eval \
    --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B,dtype="bfloat16" \
    --tasks arc_challenge \
    --batch_size 8 \
    --num_fewshot=25

Of-course, many things could cause this but I wanted to know if you have faced something similar or benchmarked light-eval compared to Harness?

If not, would you have a suggestion to get similar performance? (it seems bf16 are used by default so it should not be the culprit)

clefourrier commented 2 months ago

Hi! Is your configuration for accelerate the same? But otherwise I'll take a look, thanks for opening this issue

geoalgo commented 2 months ago

Yes, I used the default one, thanks.

clefourrier commented 2 months ago

When you say the default one, what do you mean precisely? (Could you give the output of your configuration?) Because when launching lighteval, you specifically select the number of processes to use, and not when launching the lm_eval one.

If the model is small enough to fit 2 times on a GPU, you could be doing DP8 with lm_eval, and only DP4 for lighteval (which would also explain the difference in speed)

geoalgo commented 2 months ago

I was using DDP with 4 GPUs in the two cases (if I got everything right 😅).

This is the output acceleration config I got with lighteval:

The following values were not passed to `accelerate launch` and had defaults used instead:
    `--num_machines` was set to a value of `1`
    `--mixed_precision` was set to a value of `'no'`
    `--dynamo_backend` was set to a value of `'no'`

and the one I got with lm_eval:

    `--num_processes` was set to a value of `4`
        More than one GPU was found, enabling multi-GPU training.
        If this was unintended please pass in `--num_processes=1`.
    `--num_machines` was set to a value of `1`
    `--mixed_precision` was set to a value of `'no'`
    `--dynamo_backend` was set to a value of `'no'`

One thing that I am now wondering that could cause this gap is whether lighteval uses bf16 by default (which could cause a large gap). I will rerun by setting it explicitly and let you know.

clefourrier commented 2 months ago

Thanks a lot!

geoalgo commented 2 months ago

I reran setting bf16 explicitly and it took 11 min with the following command

time accelerate launch --multi_gpu --num_processes=4 lighteval/run_evals_accelerate.py     --model_args="pretrained=meta-llama/Meta-Llama-3-8B,dtype="bfloat16""     --tasks "leaderboard|arc:challenge|25|0"     --output_dir "arc_challenge2" --override_batch_size 8

and it took longer than lm_eval so bf16 does not seem to be the culprit.