Open adamboazbecker opened 6 months ago
I think this depends on what the needs and metrics are. In general, it might be helpful to think of concrete scenarios, or classes of scenarios, we might encounter in the real world, label these scenarios as good/bad based on human intuition, and test how these scenarios perform under our metric. We expect the value of the metric to match human intuition. In many cases, we can also have separate metrics to capture different aspects of human intuition.
Agreed with the importance of breaking down user needs into categories of use cases and scenarios. To expand on ensuring metrics match human intuition, we found these following steps to be essential: (1) if any component of benchmarking is powered by synthetic data generation or evaluation, ensuring human agreeableness with e.g. LLM-as-a-judge, (2) ensuring human agreeableness with other humans (diversity in the set of human evaluators is important but some use cases require domain expertise for high quality evaluation), and (3) continually adapting metrics to possibly changing user needs.
How do we know that the metrics we use for training are reflective of real-world user needs?