Open geoalgo opened 2 months ago
Hi! Is your configuration for accelerate the same? But otherwise I'll take a look, thanks for opening this issue
Yes, I used the default one, thanks.
When you say the default one, what do you mean precisely? (Could you give the output of your configuration?) Because when launching lighteval, you specifically select the number of processes to use, and not when launching the lm_eval one.
If the model is small enough to fit 2 times on a GPU, you could be doing DP8 with lm_eval, and only DP4 for lighteval (which would also explain the difference in speed)
I was using DDP with 4 GPUs in the two cases (if I got everything right 😅).
This is the output acceleration config I got with lighteval:
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_machines` was set to a value of `1`
`--mixed_precision` was set to a value of `'no'`
`--dynamo_backend` was set to a value of `'no'`
and the one I got with lm_eval:
`--num_processes` was set to a value of `4`
More than one GPU was found, enabling multi-GPU training.
If this was unintended please pass in `--num_processes=1`.
`--num_machines` was set to a value of `1`
`--mixed_precision` was set to a value of `'no'`
`--dynamo_backend` was set to a value of `'no'`
One thing that I am now wondering that could cause this gap is whether lighteval uses bf16 by default (which could cause a large gap). I will rerun by setting it explicitly and let you know.
Thanks a lot!
I reran setting bf16 explicitly and it took 11 min with the following command
time accelerate launch --multi_gpu --num_processes=4 lighteval/run_evals_accelerate.py --model_args="pretrained=meta-llama/Meta-Llama-3-8B,dtype="bfloat16"" --tasks "leaderboard|arc:challenge|25|0" --output_dir "arc_challenge2" --override_batch_size 8
and it took longer than lm_eval so bf16 does not seem to be the culprit.
Hi,
Thanks for sharing this package, it has lots of cool features!
I saw that arc-challenge was taking about twice longer that what I have with harness, I ran the following commands with lighteval:
and the following command from harness (using big-refactor branch):
Of-course, many things could cause this but I wanted to know if you have faced something similar or benchmarked light-eval compared to Harness?
If not, would you have a suggestion to get similar performance? (it seems bf16 are used by default so it should not be the culprit)