Interpretation of predictions with xgboost

notiv commented 5 years ago

I gave it a try on Stackoverflow and the suggestion was to better try here:

The latest version of xgboost (0.7) allows for the interpretation of predictions by setting the predcontrib parameter of the predict function to TRUE. This works well if one directly uses the xgboost package, but doesn't work when using xgboost within mlr (with wrappers, cv etc.)

Long story short: Is there a way (a work-around would suffice) to pass this parameter to the predict function of the learner and then return both the predictions and the contributions of each feature to each single score?

Or otherwise: Is there a way to unwrap the final, tuned model and "at the right moment" use the predict function of xgboost with the predcontrib parameter? This doesn't need to be in the predictLearner of a possibly modified xgboost learner, but could be done in a completely separate function.

P.S. I can imagine that this use case will become more prevalent in the future with the development of packages like lime.

larskotthoff commented 5 years ago

Hmm, it sounds like you would need to change the xgboost integration to make that happen, i.e. have a custom learner.

notiv commented 5 years ago

Hi Lars, thanks for the quick response.

I did that, but encountered two issues (you can see the details in the SO post): a) The parameters are not being passed to the predictLearner.classif.xgboost.c function of the custom learner b) The checkPredictLearnerOutput does not allow the predictLearner.classif.xgboost.c function to return anything else than the standard output structure [0,1,p]. In this case I would need to return a matrix or data.frame.

Any hints how I could work around these issues (the second is of course more important). Are there any learners that provide similar functionality, so that I can take a look how I could modify my code?

larskotthoff commented 5 years ago

a) You need to set predcontrib when you're creating the learner. b) What exactly does predcontrib do? It sounds like getFeatureImportanceLearner() would be a more suitable place for it.

notiv commented 5 years ago

a) Ok, I get that. I thought it suffices that we have the ellipses in the call (...) b) Let me elaborate: The usual call, also in the default xgboost learner, is as follows: p = predict(m, newdata = data.matrix(.newdata), ...) If we want to calculate the contributions of each feature to the score of each single observation, we call the function as follows: contrib = predict(m, newdata = data.matrix(.newdata), predcontrib = TRUE, ...) contrib in this case is a Large matrix that contains the individual contributions of each feature for each scored observation, i.e. if we score 100 observations and we have 20 features, the matrix will have a dimension of 100x20.

larskotthoff commented 5 years ago

mlr doesn't currently have an interface for b), so this would be a major change that would touch much more than just the learner.

notiv commented 5 years ago

I was afraid so, but didn't want to give up hope... :-)

Is there a way to perform all "wrapper" tasks without executing the predict part and then run a "custom" predict using predcontrib = TRUE? What do you think, would mlrCPO be mature enough to replace my wrapper within a wrapper preprocessing? Or maybe any other workaround / idea?

larskotthoff commented 5 years ago

I don't think that there are any easy workarounds here -- although you could extract the model after mlr is done and work with that directly.

notiv commented 5 years ago

Ok, thanks Lars!

berndbischl commented 5 years ago

getting the "predcontrib" param down to the xgboost predict function is simple and mlr allows this. problem is returning the finegrained information, that xgboost then returns. that does not work.

we have written iml for model agnostic intepretations, did you look into that? https://github.com/christophM/iml

notiv commented 5 years ago

I'll check this out, thanks for the hint Bernd! (xgboost implements Shapley)

notiv commented 5 years ago

Just in case someone tries to solve the same problem:

iml didn't work in my case, although the problem probably has to do with xgboost and not iml (https://github.com/christophM/iml/issues/29)
I rewrote the code using mlrCPO. The extra effort was totally worth it, as I managed to get the contributions from xgboost right away, as well as use other interpretable machine learning related packages and functions (e.g. iml, DALEX, breakDown, ceterisParibus functions).

As a hint, once we have the wrapped model, we can do the following:

xgb_unwrapped <- mlr::getLearnerModel(wrapped_model, more.unwrap = TRUE)
data_after_preproc <- raw_data %>>% retrafo(wrapped_model)
predictions_w_contributions <- predict(xgb_unwrapped, data_after_preproc, predcontrib = TRUE)

mlr-org / mlr

Interpretation of predictions with xgboost #2395