Closed lewtun closed 4 months ago
Hi, Thanks for the issue!
I can confirm that the generation size is unbounded, which you can see in the task description
{"name":"math:algebra","suite":["lighteval","math"],"prompt_function":"math","hf_repo":"lighteval\/MATH","hf_subset":"algebra","hf_avail_splits":["train","test","validation"],"evaluation_splits":["test"],"few_shots_split":null,"few_shots_select":null,"generation_size":null,"metric":["quasi_exact_match_math"],"stop_sequence":["\n"],"output_regex":null,"frozen":false}
When generation_size
is null, there is no bound expect the model's max context length (should be around 8K for both these models though).
I'll check if the paper defines a maximum expected generation size, else will fix the bound to the maximum answer size + 10% maybe?
I'll check if the paper defines a maximum expected generation size, else will fix the bound to the maximum answer size + 10% maybe?
Yes, alternatively we could set the max gen size to something like 1024 or 2048 tokens since if a model cannot answer in that span then it is likely incorrect. You can see here that the authors chose 1024 tokens for models that aren't gpt2-xl
, so 2048 seems like a safe bet
Sounds perfect, will use this rn!
Is the MATH benchmark expected to run for anything beyond
batch_size=1
?Running the following command for a small model gives OOM on a single node of H100s which is a bit surprising to me:
Strangely enough, bumping up the batch size for Mistral 7B is fine:
Perhaps there's some sort of unbounded generation occurring which is causing the memory to explode for certain models like Qwen?