Precision & Recall - Githubissues

mlandry22 commented 9 years ago

New thread to post thoughts about precision and recall.

mlandry22 commented 9 years ago

Hey guys, I put out a new thread to try and separate any thoughts about precision and recall from the main modeling stream we have. You probably use email to respond, and find the latest, which will put it out of place. That's OK of course, we're not doing this professionally.

OK, so why precision and recall? Well, it happens to have come up at work today, and I've been looking over papers for hunting down the minority class.

What's interesting is that it isn't as straightforward as I thought. People talk about doing things like upsampling and downsampling as if it gets you a "better" result. And it really won't. It just might get you a better partial result. Namely, F-statistic type measures that balance precision and recall.

Here is my current thinking: using any resampling method for classification will have this effect:

similar AUC
likely worse probabilistic accuracy
- worse logLoss
- worse calibration: "5% predictions happen 5% of the time"
likely better balance of precision and recall

If this is the case, it means that these exotic algorithms like SMOTE aren't very useful except if you really care about optimizing recall. Yet that tends to not be the case when people talk about imbalanced data sets. They want to under/over-sample to get a 50/50 distribution. But if you're judging the probabilistic accuracy you likely won't be right, even if you scale probabilities (unless you know your data set well enough to know a good transformation). And the same for AUC.

Our problem likely doesn't give enough weight to precision to worry about this. That's a good reason to keep this thread out there. But it's interesting.

Paper we at H2O have been looking at to wonder how to do the rescaling properly: http://www.researchgate.net/publication/24395913_Balanced_gradient_boosting_from_imbalanced_data_for_clinical_outcome_prediction

And to me, the motivating factor behind this post is the outcome tables at the end. Notice they always have a drop in accuracy, which I am assuming because they are using a hard threshold, so when they mess with the probabilities, it judges too many as positive (rather than a tuned threshold). But their F-measure and geometric mean are often way, way better than the other methods. And AUC is comparable. It's better than the gradient boosting implementation they used, but not by enough that I think that's the interesting part. Especially since they often lose to RF in AUC.

CarbonCycles commented 9 years ago

Nice thoughts Mark...reminds me of some of the stuff we had to worry about when we dealt with sampling signals...looks like there may be an analog.

mlandry22 / icdm-2015

Precision & Recall #3