Closed edbeeching closed 1 month ago
Hi!
Grouping the request when building them is not a bug, as it allows us to run the inference only max_samples
times, not max_samples +1
times.
So for quasi_exact_match, we keep only the first sample (which is the one we would have gotten with a greedy sampling normally), and for maj at 4, we keep samples 1 to 4 for metric computations.
Did you observe that the quasi_exact_match
was computed on 4 samples? (If yes, then we have a bug)
Hi @clefourrier , the first sample of 4 is computed using sampling with a temperature, rather than greedy?
Ok got you! Yep, checked the code + ran a small xp and indeed sampling with 1 sample does not give the same results as greedy generation (which is obvious a posteriori) :sweat_smile: I'll fix this asap, thanks for the report!
Thanks, I think also with the vllm backend sampling with t=1.0 is used, even when num_samples==1
, in that case it should default to greedy to match the transformers backend (I believe), I am just confirming but I will make another issue for that.
Thanks a lot!
Describe the bug
Two metrics are reported for MATH,
Metrics.quasi_exact_match_math
andMetrics.maj_at_4_math
, the first metric should be calculated using greedy generation, whereas the second with sampling and majority voting. However, both metrics are in the categoriesMetricCategory.GENERATIVE
orMetricCategory.GENERATIVE_SAMPLING
which means that when the requests are built for this task the two metrics are grouped into the same category in theconstruct_requests
method:To Reproduce
I added some breakpoints and looked at the requests, generations, etc.
Expected behavior
I would expect separate requests to be generated for these two metrics, rather than grouping
GENERATIVE_SAMPLING
andGENERATIVE
, it may be as trivial as separating these into two if statements. I am unsure of the implications on other tasks.cc @lewtun