giuseppec / iml

iml: interpretable machine learning R package
https://giuseppec.github.io/iml/
Other
492 stars 87 forks source link

Error in `[.data.frame`(prediction, , self$class, drop = FALSE) : undefined columns selected #195

Open QuintenSand opened 2 years ago

QuintenSand commented 2 years ago

There may be an error in the tutorial of this package on this website: "Interpreting Machine Learning Models with the iml Package". If we want to check the FeatureImp of the models, the following error appears: "Error in [.data.frame(prediction, , self$class, drop = FALSE) : undefined columns selected", as mentioned in the title. Does any know why this error occurs? Here is a reproducible example:

library(rsample)   # data splitting
library(ggplot2)   # allows extension of visualizations
library(dplyr)     # basic data transformation
library(h2o)       # machine learning modeling
library(iml)       # ML interprtation

h2o.no_progress()
h2o.init()
#>  Connection successful!
#> 
#> R is connected to the H2O cluster: 
#>     H2O cluster uptime:         8 hours 33 minutes 
#>     H2O cluster timezone:       Europe/Amsterdam 
#>     H2O data parsing timezone:  UTC 
#>     H2O cluster version:        3.36.1.2 
#>     H2O cluster version age:    2 months  
#>     H2O cluster name:           H2O_started_from_R_quinten_xns163 
#>     H2O cluster total nodes:    1 
#>     H2O cluster total memory:   3.64 GB 
#>     H2O cluster total cores:    8 
#>     H2O cluster allowed cores:  8 
#>     H2O cluster healthy:        TRUE 
#>     H2O Connection ip:          localhost 
#>     H2O Connection port:        54321 
#>     H2O Connection proxy:       NA 
#>     H2O Internal Security:      FALSE 
#>     R Version:                  R version 4.1.0 (2021-05-18)

#data
library(modeldata)
data("attrition", package = "modeldata")
#classification data
df <- attrition %>%
  mutate_if(is.ordered, factor, ordered = FALSE) %>%
  mutate(Attrition = recode(Attrition, "Yes" = "1", "No" = "0") %>% factor(levels = c("1", "0")))

# convert to h2o object
df.h2o <- as.h2o(df)

# create train, validation, and test splits
set.seed(123)
splits <- h2o.splitFrame(df.h2o, ratios = c(.7, .15), destination_frames = c("train","valid","test"))
names(splits) <- c("train","valid","test")

# variable names for resonse & features
y <- "Attrition"
x <- setdiff(names(df), y) 

# elastic net model 
glm <- h2o.glm(
  x = x, 
  y = y, 
  training_frame = splits$train,
  validation_frame = splits$valid,
  family = "binomial",
  seed = 123
)

# random forest model
rf <- h2o.randomForest(
  x = x, 
  y = y,
  training_frame = splits$train,
  validation_frame = splits$valid,
  ntrees = 1000,
  stopping_metric = "AUC",    
  stopping_rounds = 10,         
  stopping_tolerance = 0.005,
  seed = 123
)
#> Warning in .h2o.processResponseWarnings(res): early stopping is enabled but neither score_tree_interval or score_each_iteration are defined. Early stopping will not be reproducible!.

# gradient boosting machine model
gbm <-  h2o.gbm(
  x = x, 
  y = y,
  training_frame = splits$train,
  validation_frame = splits$valid,
  ntrees = 1000,
  stopping_metric = "AUC",    
  stopping_rounds = 10,         
  stopping_tolerance = 0.005,
  seed = 123
)
#> Warning in .h2o.processResponseWarnings(res): early stopping is enabled but neither score_tree_interval or score_each_iteration are defined. Early stopping will not be reproducible!.

# model performance
h2o.auc(glm, valid = TRUE)
#> [1] 0.7870935
h2o.auc(rf, valid = TRUE)
#> [1] 0.7681021
h2o.auc(gbm, valid = TRUE)
#> [1] 0.7468242

# 1. create a data frame with just the features
features <- as.data.frame(splits$valid) %>% select(-Attrition)

# 2. Create a vector with the actual responses
response <- as.numeric(as.vector(splits$valid$Attrition)) 

# 3. Create custom predict function that returns the predicted values as a
#    vector (probability of purchasing in our example)
pred <- function(model, newdata)  {
  results <- as.data.frame(h2o.predict(model, as.h2o(newdata)))
  return(results[[3L]])
}

# example of prediction output
pred(rf, features) %>% head()
#> [1] 0.18181818 0.27272727 0.06060606 0.54545455 0.03030303 0.42424242

# create predictor object to pass to explainer functions
predictor.glm <- Predictor$new(
  model = glm, 
  data = features, 
  y = response, 
  predict.function = pred,
  class = "classification"
)

predictor.rf <- Predictor$new(
  model = rf, 
  data = features, 
  y = response, 
  predict.function = pred,
  class = "classification"
)

predictor.gbm <- Predictor$new(
  model = gbm, 
  data = features, 
  y = response, 
  predict.function = pred,
  class = "classification"
)

#compute feature importance with specified loss metric
imp.glm <- FeatureImp$new(predictor.glm, loss = "mse")
#> Error in `[.data.frame`(prediction, , self$class, drop = FALSE): undefined columns selected
imp.rf <- FeatureImp$new(predictor.rf, loss = "mse")
#> Error in `[.data.frame`(prediction, , self$class, drop = FALSE): undefined columns selected
imp.gbm <- FeatureImp$new(predictor.gbm, loss = "mse")
#> Error in `[.data.frame`(prediction, , self$class, drop = FALSE): undefined columns selected

Created on 2022-07-26 by the reprex package (v2.0.1)

Thank you in advance!

Mark-iGit commented 8 months ago

I came across the same issue and while I have not gone through your example in detail I guess it might be the same underlying problem. For (multiclass-?)classification tasks at some point predictions are generated in a structure where for each sample/row the probability to belong to a given class is provided in a distinct column where the column name is the class level. So if you have class levels like "1" and "0" or other syntactically non-valid strings there seems to be some magic going on under the hood which alters the class levels such that syntactically correct column names are generated. It is not as straight forward as make.names("1") which would produce "X1" because this workaround sometimes fails. The only solution I have so far is to replace any class levels with syntactically valid ones at the very beginning but I'm looking for an alternative solution as well. Maybe someone else knows one on stackoverflow

giuseppec commented 7 months ago

Should work now. There was a bug that ignored the passed predict.function for h2o binary classification models.

Btw. in such cases, you can compare the result of e.g. predictor.glm$prediction.function(features) and predictor.glm$predict(features). If the latter does not return the same as the former, something went wrong. In your case, you used the wrong value for class (you could have fixed it using class = "p1" as it then selects the corresponding column from predictor.glm$predict(features)).

Mark-iGit commented 7 months ago

Does the fix only alter the behaviour for H2O models or will it address the same problem which I have observed with caret l random forest models? Thanks!

giuseppec commented 7 months ago

Currently, the fix should only apply to model objects of class H2OBinomialModel. However, if you say you have seen this problem for other models as well. I guess the problem will persist. If you can provide minimal working examples like the ones above for the models where it did not work, I can have a look. Remember that even in your prior example, you could have made things work if you would have done this:

predictor.gbm <- Predictor$new(
  model = gbm, 
  data = features, 
  y = response, 
  predict.function = pred,
  class = "p1"
)
Mark-iGit commented 7 months ago

You can find more background and the workaround on stackoverflow but here is the minimal working example:

# ----- Packages -----
library(randomForest)
library(caret)
library(iml)

# ----- Dummy Data -----
One <- as.factor(sample(c("1", "0"), size = 250, replace = TRUE))
Two <- as.factor(sample(make.names(c("1", "0")), size = 250, replace = TRUE))
Three <- as.factor(sample(c("A-1_x", "B-0_y", "1 C-$_3.5"), size = 250, replace = TRUE))
Four <- as.factor(sample(make.names(c("A-1_x", "B-0_y", "1 C-$_3.5")), size = 250, replace = TRUE))
df <- cbind.data.frame(One, Two, Three, Four)

# ----- Modelling + IML for syntactically invalid levels from "Three" -----
ALE.ClassOfInterest <- "1 C-$_3.5"
TrainData <- cbind.data.frame(One, Two, Four)
rf <- caret::train(TrainData, Three, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
Pred <- Predictor$new(rf, data=df, class=ALE.ClassOfInterest)
FE3 <- FeatureEffects$new(Pred, features=names(df), method="ale")$results

I'm trying to get my head around why class = "p1" is a solution in the example provided by @QuintenSand. I don't have all of his libraries, is that a (syntactically correct) level of one of his classes that ends up being a (syntactically correct) column name? In my case the classes have values that would results in syntactically incorrect column names (at least that was my conclusion) but I could not figure out what iml does to avoid. A simple make.names does not always seem to do the trick...

giuseppec commented 7 months ago

Hi, thanks for the example. There seem to be other issues as well. Regarding your valid question on why class = "p1" would work above:

1) The first issue is that the passed predict.function is ignored if there exists a create_predict_fun for the model object. This is the case for h2o, caret, mlr3 etc. see also https://github.com/giuseppec/iml/issues/134#issuecomment-671260937 . I guess I'll have to fix this first e.g. by enforcing that a user-specified predict.function is used when it is passed. 2) The reason why class = "p1" and not the class levels of the Attrition (which are here "0" or "1") should be used (I agree that this is unintuitive) is an artifact of the h2o.predict() function, which produces a data.frame as predictions with columns p0 and p1, see e.g.:

> h2o.predict(rf, as.h2o(features[1:10,]))
  predict        p0         p1
1       0 0.8181818 0.18181818
2       0 0.7272727 0.27272727

Will reopen this issue