Open adamboazbecker opened 1 month ago
I noticed while reading model cards from different LLM providers that these models were evaluated with different framework and methods; therefore, it is hard to compare them apples-to-apples.
In order to compare the models fairly, I think we need to make sure we evaluate them on the same quality framework. I would first define the standard quality framework designed around my use case and then evaluate these models on it.
What do we do when different teams follow different quality frameworks?