Risk score uncertainty (="prediction intervals" in classification)

Hi Dan, first of all: thanks for the excellent library! This is amazing functionality and it is a pity that it is not available in the standard libraries. Do you have any idea whether some work is going on to change that? Have you investigated getting your implementations into, say, sklearn?

I am currently trying to find a way to quantify risk score uncertainty. As a very simple example, consider binary classification: when I perform, e.g., logistic regression, the model predicts the probability of the positive outcome. I am interested in quantifying the uncertainty of that probability estimate. I would call this a prediction interval with respect to the probability regression problem that, e.g., logistic regression is solving. (I know that prediction sets for classification are also a thing, but that is not what I'm after.)

In theory, I feel like exactly the same methods you use for the regression setting should be applicable: this is just a regression problem, after all! However, I have not been able to get things to work satisfactorily. I tried two different approaches:

Logistic Regression with doubt.Boot(.). I basically just patched the LogisticRegression object by overloading its predict method (which normally returns the predicted class) to point to its predict_proba method instead, and put that into doubt.Boot.
Logistic Regression with doubt.QuantileRegressor. For this to work, I made some very minor changes to the QuantileRegressor.fit method (mostly making it use the correct inverse link function for logistic regression).

In both cases, the fitting worked fine but the prediction intervals were way too large. In the first case, they spanned more than the whole [0, 1] interval (which obviously doesn't make any sense since logistic regression cannot possibly return values outside that range); in the second case, it more or less covered exactly the whole [0, 1] interval. I also played around with trees instead of logistic regression, with similar results.

You can find the experiments I did here and play around with them yourself if you like.

Am I doing anything wrong? Or does what I am trying to do fundamentally not make sense for some reason? Do you have any ideas how I could get this to work?

Also see my related (more theoretical) question on stats.stackexchange if interested.

Someone over at MAPIE pointed me to these two papers (1, 2), which seem to indicate that distribution-free confidence intervals in the binary regression/classification case must necessarily always "be wide". I'll be honest, the mathematical details are a bit over my head and I am uncertain whether they make some strong assumptions that are not holding in my case, but maybe that is just what I am observing? I.e., the results I see are correct and it is impossible to get better? I am completely lacking any intuition as to why that would be the case, though.

Hi @e-pet, and thanks for all the kind words!

I'd love to set this library up for being included in the scikit-learn framework (it almost is already). As always with open source projects, time has been the missing ingredient! But it's definitely something which is in the pipeline.

Classification prediction intervals have been on my mind as well (see e.g. this issue) but, just as you, I've been unsure what method to employ.

The intuitive reason why we cannot simply copy the regression prediction interval approach is essentially because of the discrete labels in classification. In regression, a 95% prediction interval is an interval satisfying that the true value will land within this interval 95% of the time. As (binary) classification labels are either 0 or 1, a prediction interval around a probability would thus have to contain the true value (0 or 1) 95% of the time, resulting in incredibly wide intervals.

One way to get some notion of prediction interval around a probability, which doesn't require them to include 0 or 1, is to only consider what I call the "model variance noise", being the noise created by the model when being trained on slightly different training datasets (see e.g. my blog post on this). This noise is independent of the labels and would thus create narrower intervals. However, note that they won't always be accurate: A model which always outputs 0 probability will have prediction intervals with 0 length, for instance, no matter how wrong these predictions are.

Note that the intervals might exceed the [0, 1] interval, so that you'd have to truncate them if necessary.

If you want to try out the "model variance noise intervals", then a hack could be implemented as follows:

# Initialise and fit the model to the training data
model = Boot(LinearRegression())
model.fit(X_train, y_train)

# This removes the residual information, reducing the uncertainty estimation to the "model variance noise"
model.residuals = []

# Perform predictions
predictions, intervals = model.predict(X_test, uncertainty=0.05)

# Truncate the intervals to [0, 1]
if len(intervals.shape) == 1:
    intervals[0] = max(0, intervals[0])
    intervals[1] = min(1, intervals[1])
else:
    intervals[:, 0] = np.maximum(0, intervals[:, 0])
    intervals[:, 1] = np.minimum(1, intervals[:, 1])

If this version of the intervals are useful to you, then I could include an argument in the Boot class to not compute these residual values and to truncate to the [0, 1] interval 🙂

Hi Dan, thanks a lot for the comprehensive reply! And also for the pointer to the Vovk paper; I am learning many new things about calibration theory these days... ;-)

I understand the issue with the prediction sets having to contain 0/1 most of the time, but that is not what I'm after. I am really thinking about the probability regression problem that is solved by, e.g., logistic regression: p(y=1|x) = f(x) + eps. That is a "normal" continuous regression problem, with outcome values in [0, 1], and thus standard PI methods for the regression setting should generally apply, right? (It is, of course, not really a "normal" regression problem because we only have binary observations that are related to p(y|x) via a very weird noise distribution.)

While your hack does unfortunately not solve my problem (I am specifically also interested in aleatoric/sample noise), it does explain why my first approach above using Boot(.) did not work: the bootstrapping procedure you implement makes use of the residuals, and these are of course not meaningful in the logistic regression case. They represent the difference to the observed binary labels, and not the difference to the true (unobservable) risk p(y=1|x), as one would need. So at least that particular bootstrapping method does not seem to be applicable to this setting.

I apologize for the somewhat weird question; maybe what I am looking for is just not possible? I am currently exploring alternative Bayesian methods to do something like this, but I would generally prefer to use a model-agnostic method if possible. I'll close this issue for now but would be very happy if you had any further comments/suggestions/ideas, of course. :-)

Hi @saattrupdan, just a quick update for future reference: sadly, it seems that what I wanted to do is fundamentally impossible. Maybe this was already apparent to you; it wasn't to me. :-)

Basically, the problem is that the distribution p(y) will look identical regardless of whether all subjects have a true risk of exactly .5 (=low/no aleatoric uncertainty), or whether the true risk is spread out widely with an average of .5 (=high aleatoric uncertainty). So, given only the outcome distribution, there is simply no way of distinguishing cases with more or less aleatoric uncertainty. This is a specific problem with the Bernoulli distribution, which is why this is not a problem in the usual regression setting: there, the outcome distribution actually varies as a function of the aleatoric noise level.

saattrupdan / doubt

Risk score uncertainty (="prediction intervals" in classification) #37