Open ProbablyFaiz opened 2 years ago
Right now, I'm thinking precision-recall graphs. They're similar to AUROC but work better in unbounded result spaces (which ~1,000,00 essentially is).
https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html
They offer a threshold-agnostic and true-positive-set-size agnostic method of measuring recommender quality.
Currently, we use the variant of recall described in Huang 2021:
This is alright, but leaves a lot to be desired with respect to a fuller understanding of our models' performance and its ability to surface useful cases. We've got some other ideas (to be documented at a later time) of what kinds of metrics might better serve us.