This PR introduces the concept of Evaluation Comparison. Users are able to select any number of evaluations to compare, then get a report of model performance. The report contains a number of sections:
Summary Metrics: Radar and Bar Plots summarizing the user-defined scoring functions as well as the model latency and tokens used
Scorecard: Allows users to compare the Models, with an emphasis on properties that differ. Also allows users to visually see a summary of the quantitative performance with differences highlighted.
Example Filter (only when comparing 2 evaluations): Allow users to interactively filter results based on comparative performance. This makes it fast and easy to find examples that are most challenging, or where model performance is must different between evaluations.
Output Comparison: A full-screen comparison experience allowing users to compare model outputs on the same inputs, see scores, latency, tokens, and more. Evaluations using multiple trials are grouped together for easy browsing.
* Quick Followups:
* - [ ] Shareability: Retain filter / row selection in URL. Probably want to let the feature bake a bit before committing to a data model in the URL.
* - [ ] Performance Audit: Audit the queries to determine areas of improvement and parallelization
* - [ ] Feature: Add LLM Cost as a derived metric
* - [ ] UX: Hover tooltips on all plots
* - [ ] UX: Use user-defined call names for evaluation calls
* - [ ] UX: Binary scores should be "confusion matrix" style
* - [ ] UX: Consider making all filter plots 1-1 aspect ratio
* - [ ] UX: Scatter plots should have dimension units
* - [ ] UX: Expand top-level refs (e.g. when a model prompt is a ref, see /wandb-designers/signal-maven)
* - [ ] Technical: The entire page uses a context `state` that all components consume - we might want to refactor this
LOOM: https://www.loom.com/share/e7f21b9d5fad427d9cec36a45c8745ed
This PR introduces the concept of Evaluation Comparison. Users are able to select any number of evaluations to compare, then get a report of model performance. The report contains a number of sections:
Good Examples: