Open distbit0 opened 1 year ago
This is a good idea, agreed. Maybe with some sort of calculated weight for each model's grades, updating the score to reflect their result (but at the same time I don't want the scoring to get too complex, and some models are downright bad at this, so there would be some sort of discrimination anyway).
Open to implementation ideas here!
Hey, I find your benchmark very interesting. Perhaps several other top models, rather than only gpt4, could be used to grade responses, as a measure to potentially mitigate gpt4's biases.
These grades could be displayed separately, so that gpt4's unadulterated grades are still visible, in case grades from other models are less accurate. A total grade could also be displayed, calculated by taking an average of the grades generated by the top n models. I'd be interested in seeing these figures as they may reveal gpt4's otherwise invisible biases, and provide rankings less influenced by the style preferences of a particular model. This may make a rubrik-free ranking more feasible, allowing the models to apply their own subjective criteria to judging responses (potentially encouraging creative/novel responses), without allowing the biases of a single model to significantly distort the result.
Interested in your thoughts. thx