Open QuintenSand opened 2 years ago
I came across the same issue and while I have not gone through your example in detail I guess it might be the same underlying problem. For (multiclass-?)classification tasks at some point predictions are generated in a structure where for each sample/row the probability to belong to a given class is provided in a distinct column where the column name is the class level. So if you have class levels like "1" and "0" or other syntactically non-valid strings there seems to be some magic going on under the hood which alters the class levels such that syntactically correct column names are generated. It is not as straight forward as make.names("1") which would produce "X1" because this workaround sometimes fails. The only solution I have so far is to replace any class levels with syntactically valid ones at the very beginning but I'm looking for an alternative solution as well. Maybe someone else knows one on stackoverflow
Should work now. There was a bug that ignored the passed predict.function
for h2o binary classification models.
Btw. in such cases, you can compare the result of e.g. predictor.glm$prediction.function(features)
and predictor.glm$predict(features)
. If the latter does not return the same as the former, something went wrong. In your case, you used the wrong value for class
(you could have fixed it using class = "p1"
as it then selects the corresponding column from predictor.glm$predict(features)
).
Does the fix only alter the behaviour for H2O models or will it address the same problem which I have observed with caret l random forest models? Thanks!
Currently, the fix should only apply to model objects of class H2OBinomialModel
. However, if you say you have seen this problem for other models as well. I guess the problem will persist. If you can provide minimal working examples like the ones above for the models where it did not work, I can have a look. Remember that even in your prior example, you could have made things work if you would have done this:
predictor.gbm <- Predictor$new(
model = gbm,
data = features,
y = response,
predict.function = pred,
class = "p1"
)
You can find more background and the workaround on stackoverflow but here is the minimal working example:
# ----- Packages -----
library(randomForest)
library(caret)
library(iml)
# ----- Dummy Data -----
One <- as.factor(sample(c("1", "0"), size = 250, replace = TRUE))
Two <- as.factor(sample(make.names(c("1", "0")), size = 250, replace = TRUE))
Three <- as.factor(sample(c("A-1_x", "B-0_y", "1 C-$_3.5"), size = 250, replace = TRUE))
Four <- as.factor(sample(make.names(c("A-1_x", "B-0_y", "1 C-$_3.5")), size = 250, replace = TRUE))
df <- cbind.data.frame(One, Two, Three, Four)
# ----- Modelling + IML for syntactically invalid levels from "Three" -----
ALE.ClassOfInterest <- "1 C-$_3.5"
TrainData <- cbind.data.frame(One, Two, Four)
rf <- caret::train(TrainData, Three, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
Pred <- Predictor$new(rf, data=df, class=ALE.ClassOfInterest)
FE3 <- FeatureEffects$new(Pred, features=names(df), method="ale")$results
I'm trying to get my head around why class = "p1"
is a solution in the example provided by @QuintenSand. I don't have all of his libraries, is that a (syntactically correct) level of one of his classes that ends up being a (syntactically correct) column name? In my case the classes have values that would results in syntactically incorrect column names (at least that was my conclusion) but I could not figure out what iml does to avoid. A simple make.names
does not always seem to do the trick...
Hi, thanks for the example. There seem to be other issues as well. Regarding your valid question on why class = "p1"
would work above:
1) The first issue is that the passed predict.function is ignored if there exists a create_predict_fun
for the model object. This is the case for h2o, caret, mlr3 etc. see also https://github.com/giuseppec/iml/issues/134#issuecomment-671260937 . I guess I'll have to fix this first e.g. by enforcing that a user-specified predict.function is used when it is passed.
2) The reason why class = "p1"
and not the class levels of the Attrition
(which are here "0"
or "1"
) should be used (I agree that this is unintuitive) is an artifact of the h2o.predict()
function, which produces a data.frame as predictions with columns p0
and p1
, see e.g.:
> h2o.predict(rf, as.h2o(features[1:10,]))
predict p0 p1
1 0 0.8181818 0.18181818
2 0 0.7272727 0.27272727
Will reopen this issue
There may be an error in the tutorial of this package on this website: "Interpreting Machine Learning Models with the iml Package". If we want to check the
FeatureImp
of the models, the following error appears: "Error in[.data.frame
(prediction, , self$class, drop = FALSE) : undefined columns selected", as mentioned in the title. Does any know why this error occurs? Here is a reproducible example:Created on 2022-07-26 by the reprex package (v2.0.1)
Thank you in advance!