Silent (potential) bug in calibration_regression function.

stanfordmlgroup / ngboost

Natural Gradient Boosting for Probabilistic Prediction

Apache License 2.0

1.64k stars 215 forks source link

Silent (potential) bug in calibration_regression function. #121

Closed jirvin16 closed 4 years ago

jirvin16 commented 4 years ago

https://github.com/stanfordmlgroup/ngboost/blob/master/ngboost/evaluation.py#L16

If Y has a 1-D shape, I.e. (N,), then since icdfs is of shape (N, 1), Y < icdfs is incorrectly computed and is of shape (N, N).

Credits to @ezelikman for the catch!

tonyduan commented 4 years ago

Thanks for the catch, Jeremy and Eric!

Back when I had written this evaluation code this repo only supported Y of shape (N, 1). At some point we added support for shape (N,) and I forgot to update accordingly.

In any case, I should have fixed this in [this commit] and now both shapes (N, 1) and (N,) should be properly supported.

astrogilda commented 4 years ago

Hey quick question about this--shouldn't calibration be done on a different set than the test set? That is, shouldn't one calibrate on the validation set, and then use the calibration parameters derived there to fix predictions on the test set?

tonyduan commented 4 years ago

@astrogilda There appears to be overloading of the term "calibration". Maybe this will help clear things up:

Calibration is a property of a set of forecasts and corresponding ground truths. It can be evaluated on any dataset, whether it be from the train/val/test split.
In your definition of "calibration" I believe you're referring to "temperature scaling" or the like, which are methods to improve calibration on a test set using a val set. Our codebase does not implement temperature scaling or any other such method.
It is not always necessary to use temperature scaling to improve calibration.

astrogilda commented 4 years ago

@tonyduan Yeah you nailed it, that's exactly what I was referring to. Though wouldn't calibrating on validation set be the only way to get calibrated outputs on test set in the wild?

Also could you expand on your last point please?

tonyduan commented 4 years ago

@astrogilda See, for example, [Guo et al. 2017]:

In 2005, Niculescu-Mizil & Caruana (2005) showed that neural networks typically produce well-calibrated probabilities on binary classification tasks. While neural networks today are undoubtedly more accurate than they were a decade ago, we discover with great surprise that modern neural networks are no longer well-calibrated.

In my experience under-specified models (logistic regression, for example and low-complexity neural networks, as supported by the above reference) will tend to produce well-calibrated probabilities without need for post-hoc calibration.

astrogilda commented 4 years ago

Ah I see. Actually, for a paper that I am writing right now, we are using ngboost for predictions, and we see a significant improvment in performance with post-hoc calibration; this should be on the arxiv in <1month, and I'd be happy to post the appropriate plots here.