lunary-ai / llm-benchmarks

LLM benchmarks
https://llm-benchmarks.vercel.app
13 stars 3 forks source link

Publish grades generated by other top models, in addition to gpt4, to control for its biases #1

Open distbit0 opened 1 year ago

distbit0 commented 1 year ago

Hey, I find your benchmark very interesting. Perhaps several other top models, rather than only gpt4, could be used to grade responses, as a measure to potentially mitigate gpt4's biases.

These grades could be displayed separately, so that gpt4's unadulterated grades are still visible, in case grades from other models are less accurate. A total grade could also be displayed, calculated by taking an average of the grades generated by the top n models. I'd be interested in seeing these figures as they may reveal gpt4's otherwise invisible biases, and provide rankings less influenced by the style preferences of a particular model. This may make a rubrik-free ranking more feasible, allowing the models to apply their own subjective criteria to judging responses (potentially encouraging creative/novel responses), without allowing the biases of a single model to significantly distort the result.

Interested in your thoughts. thx

vincelwt commented 1 year ago

This is a good idea, agreed. Maybe with some sort of calculated weight for each model's grades, updating the score to reflect their result (but at the same time I don't want the scoring to get too complex, and some models are downright bad at this, so there would be some sort of discrimination anyway).

Open to implementation ideas here!