It seems that lime explanation does not work with variables with just NAs and constant value, which do fit the XGBOOST.
For instance, I have a variable that is highly correlated to the target, in fact, it is the variable with the highest gain within the importance of variables. Besides, if we replace missing values with an extreme value we obtain a correlation with the target of 0.77.
However, it does not work within LIME explanation because its deviation is zero (it does not consider missing values, unlike xgboost). Therefore I can't use the lime benefits with these types of variables. Is there any other solution rather than removing that type of columns, which seems to work well in XGBOOST?
Here, there is a simple example of the problem. Thanks in advance
explainer1 <- lime(x=X_train,model=boost, quantile_bins = F)
Error in cut.default(x[[i]], unique(explainer$bin_cuts[[i]]), labels = FALSE, :
invalid number of intervals
explanations1 <- lime::explain(local_obs, explainer1, n_labels = 2, n_features = 2)
plot_explanations(explanations1)
Fit Lime, quantile bins = TRUE
explainer2 <- lime(x=X_train,model=boost, quantile_bins = T)
Error in cut.default(x[[i]], unique(explainer$bin_cuts[[i]]), labels = FALSE, :
invalid number of intervals
In addition: Warning messages:
1: var3 does not contain enough variance to use quantile binning. Using standard binning instead.
2: var5 does not contain enough variance to use quantile binning. Using standard binning instead.
explanations2 <- lime::explain(local_obs, explainer2, n_labels = 2, n_features = 2)
plot_explanations(explanations2)
It seems that lime explanation does not work with variables with just NAs and constant value, which do fit the XGBOOST.
For instance, I have a variable that is highly correlated to the target, in fact, it is the variable with the highest gain within the importance of variables. Besides, if we replace missing values with an extreme value we obtain a correlation with the target of 0.77.
However, it does not work within LIME explanation because its deviation is zero (it does not consider missing values, unlike xgboost). Therefore I can't use the lime benefits with these types of variables. Is there any other solution rather than removing that type of columns, which seems to work well in XGBOOST?
Here, there is a simple example of the problem. Thanks in advance
df <- data.frame(target = c(0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2), var1 = rnorm(22), var2 = rnorm(22)*10, var3 = c(rep(0,20),1,1), var4 = c(-1,-2,5,3,1,2,2,1,1,2,1,-1,5,1,1,20,2,1,0,2,2,2), var5 = c(NA,NA,NA,NA,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1))
Train Xgboost
X_train <- df %>% select(-target)
dtrain <- xgb.DMatrix(data.matrix(X_train), label = as.matrix(df$target))
boost <- xgb.train(data = dtrain, list(max_depth = 7, eta = 0.1, objective = "multi:softprob", eval_metric = "error", nthread = 1), num_class = 3, nrounds = 100) xgb.importance(feature_names = colnames(dtrain), model = boost)
local_obs <- X_train[c(1,2),]
Fit Lime, quantile bins = FALSE
explainer1 <- lime(x=X_train,model=boost, quantile_bins = F) Error in cut.default(x[[i]], unique(explainer$bin_cuts[[i]]), labels = FALSE, : invalid number of intervals explanations1 <- lime::explain(local_obs, explainer1, n_labels = 2, n_features = 2) plot_explanations(explanations1)
Fit Lime, quantile bins = TRUE
explainer2 <- lime(x=X_train,model=boost, quantile_bins = T) Error in cut.default(x[[i]], unique(explainer$bin_cuts[[i]]), labels = FALSE, : invalid number of intervals In addition: Warning messages: 1: var3 does not contain enough variance to use quantile binning. Using standard binning instead. 2: var5 does not contain enough variance to use quantile binning. Using standard binning instead. explanations2 <- lime::explain(local_obs, explainer2, n_labels = 2, n_features = 2) plot_explanations(explanations2)