[FT] Support batch metric computation for SampleLevelMetrics

huggingface / lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

MIT License

831 stars 99 forks source link

[FT] Support batch metric computation for SampleLevelMetrics #404

Closed JoelNiklaus closed 1 minute ago

JoelNiklaus commented 23 hours ago

Issue encountered

SampleLevelMetrics are always computed with batch size 1. This is really bad for more computationally expensive metrics involving LLM inference. Without batching these, it will take ages to evaluate. CorpusLevelMetrics are also not really a solution, because we want the metric on the sample level for statistics and for selecting samples for human evaluation afterwards.

Solution/Feature

In metrics.utils.init.py apply_generative_metric needs to support batches. We can still set the default to 1, but we should expose an argument metric_batch_size to the top of the evaluation.

Posssible alternatives

Currently, I don't see an alternative.

clefourrier commented 23 hours ago

SampleLevelMetrics should not be using only batch size one, and apply_generative_metrics supports batches (which are automatically inferred, as we need to group generation parameters together). Batch size, in general, is either automatically inferred or forced through a param (--batch_size, iirc).

What did you observe that on? Can you provide a command to repro?

JoelNiklaus commented 23 hours ago

Here it looks to me like there is only one example passed to the metric's compute function at a time.

Are you talking about the param --override_batch_size?

I only see the batch size applied to the model calls, but not the metric calls.

When I want to run an expensive metric like XCOMET-XXL, I need batching there.

I observed it when I evaluate the swiss_legal_evals:

python -m lighteval accelerate \
--model_args openai,model=o1-mini-2024-09-12 \
--tasks community|slt-paragraph_level:de-fr|0|0 \
--custom_tasks lighteval/community_tasks/swiss_legal_evals.py \
--output_dir outputs \
--override_batch_size 8 \
--save_details

NathanHB commented 23 hours ago

Hi ! You are right, as of now we only allow evaluating generative tasks with batch size 1. However, we had the same issue as you did for LLM as a judge metrics (very expensive to run 1 by 1) so we made another metric type (here) to be able to pass all answer to he eval function in one go.

A solution for you would be to do exactlty that. Something like generative_metric_parallel.

Other than that, we would need to change every sample_level_metrics to take as argument a list of metrics instead of passing the samples 1 by 1.

JoelNiklaus commented 22 hours ago

I see. Yes, adjusting apply_generative_metric would imply changing each sample_level_metric to take a list of samples instead of one sample. Ok, I will open a PR with a new function generative_metric_parallel together with a new MetricCategory.

NathanHB commented 22 hours ago

wdyt abou this solution @clefourrier ?

clefourrier commented 22 hours ago

Sorry, read too fast - mixed up inference/metric cost. Yep, I think you could define a BatchedSampleMetric then, and use it in this case!