Publish grades generated by other top models, in addition to gpt4, to control for its biases

Hey, I find your benchmark very interesting. Perhaps several other top models, rather than only gpt4, could be used to grade responses, as a measure to potentially mitigate gpt4's biases.

These grades could be displayed separately, so that gpt4's unadulterated grades are still visible, in case grades from other models are less accurate. A total grade could also be displayed, calculated by taking an average of the grades generated by the top n models. I'd be interested in seeing these figures as they may reveal gpt4's otherwise invisible biases, and provide rankings less influenced by the style preferences of a particular model. This may make a rubrik-free ranking more feasible, allowing the models to apply their own subjective criteria to judging responses (potentially encouraging creative/novel responses), without allowing the biases of a single model to significantly distort the result.

Interested in your thoughts. thx

lunary-ai / llm-benchmarks

Publish grades generated by other top models, in addition to gpt4, to control for its biases #1