giuseppec / iml

iml: interpretable machine learning R package
https://giuseppec.github.io/iml/
Other
491 stars 88 forks source link

Bug for FeatureEffects with method = "pdp" and categorical features #133

Open giuseppec opened 4 years ago

giuseppec commented 4 years ago

Minimal Example (see comments for the bug)

library("mlr3")
library("mlr3learners")
library("iml")

credit.task = tsk("german_credit")
lrn = lrn("classif.ranger", predict_type = "prob")
model = lrn$train(credit.task)
m = Predictor$new(model, data = credit.task$data(), y = "credit_risk")

# BUG for method = "pdp" for categorical features:
pdp = FeatureEffects$new(m, feature = c("status", "duration"), method = "pdp")
plot(pdp)

# this works:
pdp = FeatureEffect$new(m, feature = "status", method = "pdp")
plot(pdp)

# this also works:
pdp = FeatureEffects$new(m, feature = c("status", "duration"), method = "pdp+ice")
plot(pdp)
giuseppec commented 4 years ago

The bug is an empty plot for the pdp of categorical feature (while it works for the FeatureEffect method) image

pat-s commented 4 years ago

Thanks!

The issue is that after calculating PDP, integer features are numeric. Then a prediction to newdata with the changed feature is tried. However, the model was trained on an integer feature and should now predict to a numeric one and fails.

Internally, get.grid() causes the type transformation here

https://github.com/christophM/iml/blob/dcabe538715588c5f2c4de50e2eb543c335d9a9b/R/FeatureEffect.R#L339-L342

and more specifically in get.grid.1D:

https://github.com/christophM/iml/blob/dcabe538715588c5f2c4de50e2eb543c335d9a9b/R/utils.R#L186-L195

Here, get.grid() used type = "equidist" in every case and tries to create a sequence of equally spaced values. This does not work with integers.

I'd argue that in this case we should treat integers like factors since we quite often cannot create a sequence of integers given the "large" grid.sizes (with 20 as the default) for integer variables.

Speaking of the current state, calculation of PDP is not possible and AFAICS and I am wondering that this did not come up earlier.

@christophM What do you think? Should integers be treated as nominal features here?


Fun fact: In ?seq I see:

seq.int and the default method of seq for numeric arguments return a vector of type "integer" or "double": programmers should not rely on which.

christophM commented 4 years ago

Integers are a bit tricky, because when you have very few values, treating them as factors might be desirable, but if you have some more values numerical might be more desirable. This concerns just the visualization.

Maybe having a special case for integers in get.grid would solve the issue. For integer features with < grid.size values we take all unique values. If type="equidistant" and the unique values count > grid.size, it's a bit more difficult. Maybe we compute the grid and round to the next integers?

bappa10085 commented 1 year ago

I am also facing the same issue. Whether there is a fix for this problem as on date?