Closed JoelNiklaus closed 1 minute ago
SampleLevelMetrics
should not be using only batch size one, and apply_generative_metrics
supports batches (which are automatically inferred, as we need to group generation parameters together).
Batch size, in general, is either automatically inferred or forced through a param (--batch_size
, iirc).
What did you observe that on? Can you provide a command to repro?
Here it looks to me like there is only one example passed to the metric's compute function at a time.
Are you talking about the param --override_batch_size?
I only see the batch size applied to the model calls, but not the metric calls.
When I want to run an expensive metric like XCOMET-XXL, I need batching there.
I observed it when I evaluate the swiss_legal_evals:
python -m lighteval accelerate \
--model_args openai,model=o1-mini-2024-09-12 \
--tasks community|slt-paragraph_level:de-fr|0|0 \
--custom_tasks lighteval/community_tasks/swiss_legal_evals.py \
--output_dir outputs \
--override_batch_size 8 \
--save_details
Hi ! You are right, as of now we only allow evaluating generative tasks with batch size 1. However, we had the same issue as you did for LLM as a judge metrics (very expensive to run 1 by 1) so we made another metric type (here) to be able to pass all answer to he eval function in one go.
A solution for you would be to do exactlty that. Something like generative_metric_parallel
.
Other than that, we would need to change every sample_level_metrics to take as argument a list of metrics instead of passing the samples 1 by 1.
I see. Yes, adjusting apply_generative_metric
would imply changing each sample_level_metric to take a list of samples instead of one sample. Ok, I will open a PR with a new function generative_metric_parallel
together with a new MetricCategory.
wdyt abou this solution @clefourrier ?
Sorry, read too fast - mixed up inference/metric cost.
Yep, I think you could define a BatchedSampleMetric
then, and use it in this case!
Issue encountered
SampleLevelMetrics are always computed with batch size 1. This is really bad for more computationally expensive metrics involving LLM inference. Without batching these, it will take ages to evaluate. CorpusLevelMetrics are also not really a solution, because we want the metric on the sample level for statistics and for selecting samples for human evaluation afterwards.
Solution/Feature
In metrics.utils.init.py apply_generative_metric needs to support batches. We can still set the default to 1, but we should expose an argument metric_batch_size to the top of the evaluation.
Posssible alternatives
Currently, I don't see an alternative.