huggingface / lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
MIT License
845 stars 100 forks source link

[BUG] MATH eval is calculated with non-greedy sampling #341

Closed edbeeching closed 1 month ago

edbeeching commented 1 month ago

Describe the bug

Two metrics are reported for MATH, Metrics.quasi_exact_match_math and Metrics.maj_at_4_math, the first metric should be calculated using greedy generation, whereas the second with sampling and majority voting. However, both metrics are in the categories MetricCategory.GENERATIVE or MetricCategory.GENERATIVE_SAMPLING which means that when the requests are built for this task the two metrics are grouped into the same category in the construct_requests method:

        if (
            self.has_metric_category[MetricCategory.GENERATIVE_SAMPLING]
            or self.has_metric_category[MetricCategory.GENERATIVE]
            or self.has_metric_category[MetricCategory.GENERATIVE_LOGPROB]
        ):
            # All these tasks require the same generation process - we can do them in one step
            use_logits = self.has_metric_category[MetricCategory.GENERATIVE_LOGPROB]
            requests[RequestType.GREEDY_UNTIL] += [
                GreedyUntilRequest(
                    task_name=current_task_name,
                    sample_index=document_id_seed,
                    request_index=0,
                    context=context,
                    stop_sequence=self.stop_sequence,
                    generation_size=self.generation_size,
                    generation_grammar=self.generation_grammar,
                    num_samples=max(self.num_samples),  # If we have several samplings to apply, we use the max
                    use_logits=use_logits,
                    metric_categories=[
                        c
                        for c in [
                            MetricCategory.GENERATIVE_SAMPLING,
                            MetricCategory.GENERATIVE,
                            MetricCategory.GENERATIVE_LOGPROB,
                        ]
                        if self.has_metric_category[c]
                    ],
                )
            ]

To Reproduce

lighteval accelerate --model_args=pretrained=Qwen/Qwen2.5-Math-1.5B-instruct,dtype=float16,use_chat_template=True --tasks=lighteval|math_cot:algebra|0|0 --output_dir=./evals/math_cot

I added some breakpoints and looked at the requests, generations, etc.

Expected behavior

I would expect separate requests to be generated for these two metrics, rather than grouping GENERATIVE_SAMPLING and GENERATIVE, it may be as trivial as separating these into two if statements. I am unsure of the implications on other tasks.

cc @lewtun

clefourrier commented 1 month ago

Hi! Grouping the request when building them is not a bug, as it allows us to run the inference only max_samples times, not max_samples +1 times. So for quasi_exact_match, we keep only the first sample (which is the one we would have gotten with a greedy sampling normally), and for maj at 4, we keep samples 1 to 4 for metric computations.

Did you observe that the quasi_exact_match was computed on 4 samples? (If yes, then we have a bug)

edbeeching commented 1 month ago

Hi @clefourrier , the first sample of 4 is computed using sampling with a temperature, rather than greedy?

clefourrier commented 1 month ago

Ok got you! Yep, checked the code + ran a small xp and indeed sampling with 1 sample does not give the same results as greedy generation (which is obvious a posteriori) :sweat_smile: I'll fix this asap, thanks for the report!

edbeeching commented 1 month ago

Thanks, I think also with the vllm backend sampling with t=1.0 is used, even when num_samples==1, in that case it should default to greedy to match the transformers backend (I believe), I am just confirming but I will make another issue for that.

clefourrier commented 1 month ago

Thanks a lot!