Open tballison opened 5 years ago
Would it be sufficient to output a confusion matrix at the experiment level? Experiment A's results have 90% overlap with Experiment B, and Experiment B has 50% overlap with Experiment C...etc.
Or do we also need to have per-query comparisons to allow for drill down?
The first is straightforward, and we have a model for that already. The second, I worry, would become way too much information...what exactly would it look like?
The confusion matrix makes the most sense at a reporting level. The drill-down seems like a task for a different kind of tool?
As mentioned in #41 , it may be helpful to compare result set overlap whether or not judgments are available.