lm-sys / llm-decontaminator

Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"
Apache License 2.0
293 stars 23 forks source link

One-time evaluation? Sound like making final exams for students when we are at school #1

Open yhyu13 opened 1 year ago

yhyu13 commented 1 year ago

We propose to build fresh one-time questions to evaluate LLMs instead of relying on static benchmarks.

This is one of your proposal in the paper. It might easy for coding/math problems as they can be generated from almost infinite combinations.

Is there an active community in pulling out such one-time evaluation for other domains?

vishaal27 commented 1 year ago

This ICCV paper explores topics on similar lines: https://arxiv.org/abs/2212.02774

andy-yang-1 commented 1 year ago

Is there an active community in pulling out such one-time evaluation for other domains?

@yhyu13 @vishaal27 I believe chat arena is one of the best dynamic benchmarks. It shows people's preference with real voting. Besides, dynabench is a good approach. It can also prevent overfitting benchmarks.

yhyu13 commented 1 year ago

Is there an active community in pulling out such one-time evaluation for other domains?

@yhyu13 @vishaal27 I believe chat arena is one of the best dynamic benchmarks. It shows people's preference with real voting. Besides, dynabench is a good approach. It can also prevent overfitting benchmarks.

The only question is, would these benchmarks be considered as authoritarian standards?