huggingface / lighteval

LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
MIT License
471 stars 55 forks source link

Large memory usage on MATH #80

Closed lewtun closed 4 months ago

lewtun commented 4 months ago

Is the MATH benchmark expected to run for anything beyond batch_size=1?

Running the following command for a small model gives OOM on a single node of H100s which is a bit surprising to me:

accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
    --tasks="lighteval|math:algebra|5|0" \
    --output_dir "./scratch/evals" \
    --model_args "pretrained=Qwen/Qwen1.5-0.5B" \
    --override_batch_size 2

Strangely enough, bumping up the batch size for Mistral 7B is fine:

accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
    --tasks="lighteval|math:algebra|5|0" \
    --output_dir "./scratch/evals" \
    --model_args "pretrained=mistralai/Mistral-7B-v0.1" \
    --override_batch_size 2

Perhaps there's some sort of unbounded generation occurring which is causing the memory to explode for certain models like Qwen?

clefourrier commented 4 months ago

Hi, Thanks for the issue!

I can confirm that the generation size is unbounded, which you can see in the task description

{"name":"math:algebra","suite":["lighteval","math"],"prompt_function":"math","hf_repo":"lighteval\/MATH","hf_subset":"algebra","hf_avail_splits":["train","test","validation"],"evaluation_splits":["test"],"few_shots_split":null,"few_shots_select":null,"generation_size":null,"metric":["quasi_exact_match_math"],"stop_sequence":["\n"],"output_regex":null,"frozen":false}

When generation_size is null, there is no bound expect the model's max context length (should be around 8K for both these models though).

I'll check if the paper defines a maximum expected generation size, else will fix the bound to the maximum answer size + 10% maybe?

lewtun commented 4 months ago

I'll check if the paper defines a maximum expected generation size, else will fix the bound to the maximum answer size + 10% maybe?

Yes, alternatively we could set the max gen size to something like 1024 or 2048 tokens since if a model cannot answer in that span then it is likely incorrect. You can see here that the authors chose 1024 tokens for models that aren't gpt2-xl, so 2048 seems like a safe bet

clefourrier commented 4 months ago

Sounds perfect, will use this rn!