thomasp85 / lime

Local Interpretable Model-Agnostic Explanations (R port of original Python package)
https://lime.data-imaginist.com/
Other
486 stars 110 forks source link

Error in explain function with H2O GBM regression model - Error in if (r2 > max) { : missing value where TRUE/FALSE needed #47

Closed andresrcs closed 7 years ago

andresrcs commented 7 years ago

Hi, can you please check again into issue #46 ? Just for curiosity I tried droping the month.lbl variable and now I dont get the warning message but stil have the same error message even though my training data covers the full feature space.

library(tidyverse)
library(h2o)
library(lime)

dataset_url <- "https://www.dropbox.com/s/t3o1zvzq0t7emz4/sales.RDS?raw=1"
sales_aug <- readRDS(gzcon(url(dataset_url)))

sales_aug <- sales_aug %>% select(-month.lbl) # Dropping factor variable with non full feature range

train <- sales_aug %>% filter(month <= 8)
valid <- sales_aug %>% filter(month == 9)
test <- sales_aug %>% filter(month >= 10)

h2o.init()
h2o.no_progress()
train <- as.h2o(train)
valid <- as.h2o(valid)
test <- as.h2o(test)

y <- "amount"
x <- setdiff(names(train), y)

leaderboard <- h2o.automl(x, y, training_frame = train, validation_frame = valid, leaderboard_frame = test, max_runtime_secs = 30, stopping_metric = "MSE", seed = 12345)
gbm_model <- leaderboard@leader

explainer <- lime(as.data.frame(train), gbm_model, bin_continuous = FALSE)
explanation <- explain(as.data.frame(test[1:5,]), explainer, n_features = 5)
#> Error in if (r2 > max) {: missing value where TRUE/FALSE needed
thomasp85 commented 7 years ago

Can i get you to try it with the latest version of lime from GitHub?

andresrcs commented 7 years ago

The previous error message is gone but now there is a new one

explanation <- explain(as.data.frame(test[1:5,]), explainer, n_features = 5)
#> Error in glmnet(x[, c(features, j), drop = FALSE], y, weights = weights,  : x should be a matrix with 2 or more columns
thomasp85 commented 7 years ago

Ok, so the reason for that error is quite specific to your dataset. Basically you have a single column (index.num) whose range is so extreme that, due to the fact that you do not bin continuous variables, completely dominates your dataset when it comes to calculating the similarity of the permutations. Basically all permutations gets weighted with 0 resulting in errors in the model fit.

Based on the name and the values I would throw that column out unless you have very good reasons to keep it. If you really need it, then either play with the kernel_size parameter or use bin_continuous = TRUE (the latter will give more interpretable explanations anyway)

thomasp85 commented 7 years ago

I've added a meaningful error message for cases like yours were the similarity of the permutations to the original observation is zero and a local model cannot be created