Calibration-in-the-large and calibration slope

studerus commented 8 years ago

These two performance measures are essential, particularly in external validation studies. Several guidelines and textbooks on clinical prediction modellling recommend them, for example:

http://dx.doi.org/10.1093/eurheartj/ehu207 http://dx.doi.org/10.7326/M14-0698

Formulas for calculating these measures can be found in the textbook of Steyerberg: http://www.springer.com/us/book/9780387772431

They also seem to be calculated by the validate function in package rms. I'm not sure if I could implement it myself and I only have limited time. I would highly appreciate if someone could implement this.

berndbischl commented 8 years ago

Well we all have limited time :) I don't think this very much high on our own prio list, there is so much else to do..

If you want this at least try to do a PR that we can then support and check?

studerus commented 8 years ago

I realised that that one can easily calculate these (and many other performance measures not yet implemented in mlr) with the function rms::val.prob. It just needs a vector of predicted probabilities and observed outcomes. I will try to do a PR when I got the time.

berndbischl commented 8 years ago

@giuseppec Can you support this abit here? I mean you also know quite a lot about this, calibration in the large and so on?

giuseppec commented 8 years ago

We already have a possibility to visualize the calibration slope, see http://mlr-org.github.io/mlr-tutorial/devel/html/classifier_calibration/index.html and https://github.com/mlr-org/mlr/issues/402 .

For "calibration-in-the-large" I am not sure how this measure will help people to assess the performance of the model. The measure compares the "averaged prediction probabilities" and the "proportion of positive class labels". In http://onlinelibrary.wiley.com/doi/10.1111/biom.12455/pdf , we agreed with Nancy Cook that "calibration-in-the-large is not sufficient to fully assess the calibration of a prediction model. Rather, it should be seen as the most basic requirement because many aspects of calibration may not be accounted for even if calibration-in-the-large is satisfied". However, I have a R-Package that can be used with mlr and allows you to at least visualize this measure

library(RBPcurve)
lrn = makeLearner("classif.randomForest", predict.type = "prob")
tr = train(lrn, bc.task)
pred = predict(tr, bc.task)
obj = makeRBPObj(getPredictionProbabilities(pred), getTaskTargets(bc.task))
plotRBPCurve(obj, cond.axis = TRUE, type = "b", col = 2)
addGoodCalib(obj)
# If the value of both integrals is equal, then the 
# "averaged prediction probabilities" and the "proportion of positive class labels" are equal.

Is that enough for you or do you need a mlr measure for this?

studerus commented 8 years ago

Thanks. I think it would still be useful to have all the measures from the function val.prob of package rms, which are described as follows:

Given a set of predicted probabilities p or predicted log odds logit, and a vector of binary outcomes y that were not used in developing the predictions p or logit, val.prob computes the following indexes and statistics: Somers' D{xy} rank correlation between p and y [2(C-.5), C=ROC area], Nagelkerke-Cox-Snell-Maddala-Magee R-squared index, Discrimination index D [ (Logistic model L.R. chi-square - 1)/n], L.R. chi-square, its P-value, Unreliability index U, chi-square with 2 d.f. for testing unreliability (H0: intercept=0, slope=1), its P-value, the quality index Q, Brier score (average squared difference in p and y), Intercept, and Slope, E{max}=maximum absolute difference in predicted and calibrated probabilities, the Spiegelhalter Z-test for calibration accuracy, and its two-tailed P-value.

It would be very easy to implement these measures by just programming a wrapper around val.prob. The function has been programmed by Frank Harrell, which you probably all know is one of the leading experts in clinical prediction modelling. If he considers these measures useful, then probably many other people in the field would do as well.

berndbischl commented 8 years ago

I also dont think that we are making the package worse by wrapping the calls to these measures?

Especially as mlr can be used to do multi criteria evaluation and even tuning?

berndbischl commented 8 years ago

so we would simply create a bunch of new measures which operate on binary labels and predicted probabilities?

studerus commented 8 years ago

I created the following wrapper for the calibration slope.

#' @export slope
#' @rdname measures
#' @format none
slope = makeMeasure(id = "slope", minimize = FALSE, best = 1, worst = 0,
  properties = c("classif", "req.pred", "req.truth", "req.prob"),
  name = "Calibration slope",
  note = "See `?rms::val.prob`.",
  fun = function(task, model, pred, feats, extra.args) {
    measureSlope(getPredictionProbabilities(pred), pred$data$truth, pred$task.desc$positive)
  }
)

#' @export measureSlope
#' @rdname measures
#' @format none
measureSlope = function(probabilites, truth, positive) {
  y = as.numeric(truth == positive)
  rms::val.prob(probabilites, y, pl = FALSE)[["Slope"]]
}

The question is how to define the arguments minimize, best and worse. The best value is 1, but the slope can become both bigger and smaller than 1.

giuseppec commented 8 years ago

Hm good question. What Would you say that a slope of 1.1 is as good as a slope of 0.9? I mean the slope could also be negative, right? The only possibility I currently see is to transform the slope so that it fits to a [best, worst] intervall like abs(1-slope) or something.

berndbischl commented 8 years ago

how is the calib slope usually handled? i mean it seems that one does not "optimize" this? do you simply "observe" this?

giuseppec commented 8 years ago

I have never seen that this is optimized. What people usually do is observe and "recalibrate" the predictions, e.g. (if I remember well) whenever the slope is not 1, one can multiply the current predictions with a constant value (or use a linear trafo) so that the slope will be closer to 1.

berndbischl commented 8 years ago

this then does not fit perfecly well into mlr's measure system. what we can do is then

add a note about this
set best=1, worst=NA, and minimize=NA
and generate errors if the measure is passed into things like tuneParams

this requires touching code in a couple of places! note that this sound simple, as be introduce now an annoying "special case".

isnt there any other way out than this? because currently i would rather not do this....

berndbischl commented 8 years ago

well. there is function checkMeasures. we COULD allow these NAs as outlined above. and then throw an error in checkMeasures if the first measure has minimize = NA.

if somebody is willing to properly look at the codebase to do this and add tests, we can further discuss this.

berndbischl commented 8 years ago

disregard what i said about checkMeasures. this is not so simple

bhvieira commented 8 years ago

Platt scaling (Logistic regression) and Isotonic regression (or actually any kind of bounded regression, not that I've ever seen any other than those two) could be used to improve the calibration. Check the docs in scikit: Probability calibration.

Also, there's a measure I know is associated with calibration plots: Kolmogorov-Smirnov distance.

EDIT: Reading stackexchange today I came across this: Predicting Good Probabilities With Supervised Learning. Good reading :)

EDIT2: In case anyone stumble upon this, here's an example I wrote with random data, representing scores (could be probability estimates) of labeled data.

#Change N to see how the fit changes
N = 1E4
y=sample(0:1, N, replace = TRUE)
x=y
#Try other distributions
x[x==0] = rnorm(length(x[x==0]), mean = -1)
#If both are normaly distributed both solutions will work almost the same as N increases, check it
#x[x==1] = rnorm(length(x[x==1]), mean = 1)
#Let's try a different one
x[x==1] = runif(length(x[x==1]), min = 0, max = 4)
y.iso=isoreg(x,y)$yf
y.log = glm.fit(x = x, y = (y-min(y))/(max(y)-min(y)), family=binomial(link='logit'))$fitted.values
y.log = y.log*(max(y)-min(y))+min(y)
P = ecdf(y)
P.iso = ecdf(y.iso)
P.log = ecdf(y.log)
plot(x,jitter(y,.01),cex=0.5,pch=16,col=adjustcolor("black", alpha = 0.2), ylab = "Class", xlab = "Score");lines(sort(x),y.iso,lwd=2,col=2);lines(x[order(y.log)],y.log[order(y.log)],lwd=2,col=3);legend("topleft", legend = c("isotonic regression","logistic regression"), lwd = 2, col = 2:3)

mlr-org / mlr

Calibration-in-the-large and calibration slope #842