Closed psinger closed 5 months ago
Thanks for the feedback! What is the purpose of repetition_penalty? Could you explain a bit more?
It is not only repetition_penalty
this is just an example, but you might also want to benchmark on different other generation parameters.
But regarding repetition_penalty
- specifically some small models have issues with repetitive output, where setting this to something like 1.1 can help a lot. There are several popular models that have it as a default setting in generation_config
on HF for example, but it won't be picked up here in the benchmarks.
@psinger Agreed we should make this more customizable! also, curious about your feedback on this new benchmark. do you find it useful?
in general yes, but I think it is currently too biased towards coding
@psinger Thanks for the feedback! Would you be able to submit an PR request to implement the additional customization? We would love to have it implemented in Arena Hard. Would much appreciate it!
I agree that the decoding hyperparameters are critical to generation quality. However, I am a bit concerned about the fairness of comparison. For instance, when you find A > B on the leaderboard, but A uses a set of carefully tuned decoding hyperparameters while B is the default greedy decoding...
Of course, you can rerun the evaluation using the same hyperparameter. But it would cost more and make the existing leaderboard less useful...
Certain generation parameters are part of a model though. In mt-bench the generation config is also honored.
Here is an example of such a default setting: https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat/blob/main/generation_config.json#L9
Actually, I think setting repetition_penalty
to at least 1.05
for all models is a more fair comparison compared to keeping it at 1.0
.
Most of the inference interfaces also have this set by default higher.
Yes, I agree. However, I found many models do not specify the decoding hyperparameters in generation_config.json
. How should we handle these cases?
BTW, I agree with the repetition_penalty
part as it generally helps.
It is a tricky topic, but I think it is the model creators job to set good defaults there knowing the best what works for their models.
The regular user will run the models exactly how they are specified in HF configs.
I agree with setting generation sampling parameters because the evaluation of subjective dialogue experience should align with the model's settings during real conversations. By the way, we have supported the evaluation of the ArenaHard dataset in Opencompass, and you can specify greedy decoding or sampling parameters. You can also specify accelerator (like VLLM or LMdeploy) to speed up model inference. More information see in here: https://github.com/lm-sys/arena-hard/issues/13
Currently, the only generation parameter that can be set is
temperature
: https://github.com/lm-sys/arena-hard/blob/main/config/gen_answer_config.yaml#L5However, it would be useful to also be able to set other parameters, such as
repetition_penalty
, in best case on a model level. These then could be passed accordingly to the API endpoints for generation.