Open yhyu13 opened 1 year ago
This ICCV paper explores topics on similar lines: https://arxiv.org/abs/2212.02774
Is there an active community in pulling out such one-time evaluation for other domains?
@yhyu13 @vishaal27 I believe chat arena is one of the best dynamic benchmarks. It shows people's preference with real voting. Besides, dynabench is a good approach. It can also prevent overfitting benchmarks.
Is there an active community in pulling out such one-time evaluation for other domains?
@yhyu13 @vishaal27 I believe chat arena is one of the best dynamic benchmarks. It shows people's preference with real voting. Besides, dynabench is a good approach. It can also prevent overfitting benchmarks.
The only question is, would these benchmarks be considered as authoritarian standards?
This is one of your proposal in the paper. It might easy for coding/math problems as they can be generated from almost infinite combinations.
Is there an active community in pulling out such one-time evaluation for other domains?