lmarena / arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.
Apache License 2.0
656 stars 74 forks source link

Allow to set generation sampling parameters #9

Closed psinger closed 5 months ago

psinger commented 7 months ago

Currently, the only generation parameter that can be set is temperature: https://github.com/lm-sys/arena-hard/blob/main/config/gen_answer_config.yaml#L5

However, it would be useful to also be able to set other parameters, such as repetition_penalty, in best case on a model level. These then could be passed accordingly to the API endpoints for generation.

CodingWithTim commented 7 months ago

Thanks for the feedback! What is the purpose of repetition_penalty? Could you explain a bit more?

psinger commented 7 months ago

It is not only repetition_penalty this is just an example, but you might also want to benchmark on different other generation parameters.

But regarding repetition_penalty - specifically some small models have issues with repetitive output, where setting this to something like 1.1 can help a lot. There are several popular models that have it as a default setting in generation_config on HF for example, but it won't be picked up here in the benchmarks.

infwinston commented 7 months ago

@psinger Agreed we should make this more customizable! also, curious about your feedback on this new benchmark. do you find it useful?

psinger commented 7 months ago

in general yes, but I think it is currently too biased towards coding

CodingWithTim commented 7 months ago

@psinger Thanks for the feedback! Would you be able to submit an PR request to implement the additional customization? We would love to have it implemented in Arena Hard. Would much appreciate it!

chujiezheng commented 7 months ago

I agree that the decoding hyperparameters are critical to generation quality. However, I am a bit concerned about the fairness of comparison. For instance, when you find A > B on the leaderboard, but A uses a set of carefully tuned decoding hyperparameters while B is the default greedy decoding...

chujiezheng commented 7 months ago

Of course, you can rerun the evaluation using the same hyperparameter. But it would cost more and make the existing leaderboard less useful...

psinger commented 7 months ago

Certain generation parameters are part of a model though. In mt-bench the generation config is also honored.

Here is an example of such a default setting: https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat/blob/main/generation_config.json#L9

Actually, I think setting repetition_penalty to at least 1.05 for all models is a more fair comparison compared to keeping it at 1.0.

Most of the inference interfaces also have this set by default higher.

chujiezheng commented 7 months ago

Yes, I agree. However, I found many models do not specify the decoding hyperparameters in generation_config.json. How should we handle these cases?

BTW, I agree with the repetition_penalty part as it generally helps.

psinger commented 7 months ago

It is a tricky topic, but I think it is the model creators job to set good defaults there knowing the best what works for their models.

The regular user will run the models exactly how they are specified in HF configs.

bittersweet1999 commented 6 months ago

I agree with setting generation sampling parameters because the evaluation of subjective dialogue experience should align with the model's settings during real conversations. By the way, we have supported the evaluation of the ArenaHard dataset in Opencompass, and you can specify greedy decoding or sampling parameters. You can also specify accelerator (like VLLM or LMdeploy) to speed up model inference. More information see in here: https://github.com/lm-sys/arena-hard/issues/13