thomasp85 / lime

Local Interpretable Model-Agnostic Explanations (R port of original Python package)
https://lime.data-imaginist.com/
Other
483 stars 110 forks source link

Slow with thousands of features #66

Open johanneswaage opened 6 years ago

johanneswaage commented 6 years ago

Thanks for porting lime to R!

I'm trying to explain a large xgboost model, with thousands of predictors from a TF-IDF matrix. Creating an "explainer" is fast, but explaining single observations using lime::explain takes hours, making its use unfeasible for production. Is this a side-effect from the implementation?

(Unfortunately, I can't provide a reprex).

Thanks, J

thomasp85 commented 6 years ago

Does it take hours to explain a single observation or all 1000? If it’s the latter then that is expected. If it’s the former there is a potential bug...

pommedeterresautee commented 6 years ago

How many features are you using in your explanation? (I used it with millions of columns because of ngrams, and it works quite rapidly). If you have long texts, plus trying to get many variables, then it s expected that the complexity explodes because of the large number of possibilities to explore.

johanneswaage commented 6 years ago

I'm using between 2000 and 5000 features. A few hundred as one-hot categoricals, and the rest as numeric tdidfs. The matrix is very sparse. I'm explaining one observation, one class and 5 features. Here's the runtime for a sample set with 500 rows trained with a xgboost model.

image

Debugging, the primary time-sink is dist <- c(0, dist(feature_scale(perms, explainer$feature_distribution, explainer$feature_type, explainer$bin_continuous), method = dist_fun)[seq_len(n_permutations-1)]) in R/dataframe.R

Thanks, Johannes

pommedeterresautee commented 6 years ago

The file R/dataframe should not be accessed ! Can you show your sourcecode? It seems to me that you are not providing data as a character vector.

johanneswaage commented 6 years ago

I’m not accessing dataframe.R directly, I was just pointing out the code step that takes the lions share of time. I train my model on preprocessed and tokenized data, so I’m not sure why I would provide data as a character vector, when I have a numeric matrix?

pommedeterresautee commented 6 years ago

Ok I now understand what is happening. There are 2 functions to perfom the work and they are S3 functions, meaning depending of the way you provide data you are using one of them. It s up to Lime to generate the features it will use for the explanation. You are providing a matrix and the S3 function route you to the data.frame function. So you are not using the text explanation. Please follow the Vignette, there is a part dedicated to text data.

johanneswaage commented 6 years ago

Ah, ok, I see what you mean. In my dataset matrix (or data frame), I have a mix of one-hot encoded categoricals and the TF-IDF features. I guess this is a pretty common use case. Would this mean, that I would have get separate explanations for the categoricals (using the data frame method) and the text (using the character method)?

pommedeterresautee commented 6 years ago

Nope, transform your category in words (like cat_X, cat_Y) then the whole thing will become a matrix at the end.

johanneswaage commented 6 years ago

Dear both, Here's a reprex that closely mimics my problem. The dataset is a combination of character words transformed to tf_idfs, and categorical metadata. I'm keen to hear, how this data is best run through lime.

library(tidyverse)
library(caret)
library(xgboost)

no_observations <- 500
no_tf_idf_features <- 1000

# Create simulated response
response <- sample(letters[1:5], no_observations, replace = TRUE)

# Create simulated TF_IDF matrix
tf_idf <- rnorm(no_observations * no_tf_idf_features) %>% matrix(ncol = no_tf_idf_features)
colnames(tf_idf) <- paste0("tfidf_", seq_len(no_tf_idf_features))
tf_idf_tbl <- tf_idf %>% as_tibble

# Create simulated categorical
cat_tbl <- bind_cols(
  cat_1 = sample(letters[1:10], no_observations, replace = TRUE),
  cat_2 = sample(letters[1:10], no_observations, replace = TRUE),
  cat_3 = sample(letters[1:10], no_observations, replace = TRUE)
)

# Merge together and one-hot-encode
train_data <- bind_cols(cat_tbl,
                        tf_idf_tbl)

train_data_m <- Matrix::sparse.model.matrix(~.-1, data = train_data)

# Train using caret
cv_5 <- trainControl(method = "cv", 
                     number = 5, 
                     allowParallel = FALSE)

single_grid <- expand.grid(
  nrounds = 25,
  max_depth = 5,
  eta = 0.1,
  gamma = 0,
  colsample_bytree = 0.5,
  min_child_weight = 1,
  subsample = 1
)

xgb_no_tune <- caret::train(x=train_data_m,
                     y=response,
                     method="xgbTree",
                     metric="Accuracy",
                     trControl = cv_5, 
                     tuneGrid = single_grid,
                     nthread = 1)

# Explain using lime
explainer <- lime::lime(train_data, xgb_no_tune, bin_continuous=T)
explanation <- lime::explain(x = train_data[1,], explainer = explainer, labels = "a", n_features = 5)

Obviously, you would probably explain on a test set, but nonetheless...

Thanks for you time and help /J

thomasp85 commented 6 years ago

Unfortunately lime is really not able to handle this type of mixed model-type very well at the moment... I'll try to think about this a bit but it is a hard problem to solve

jdegange commented 6 years ago

I'm having issues trying to run this library in Spark environment. Curious if anyone has had a workaround for this or more generally running in parallel framework?

platinum736 commented 5 years ago

Hi All, I have encountered similar issues on slowness while running explain on a particular instance, it takes more than a minute for each instance. I have 1800 features in my model which are combination of BoW and numeric features.

Please let me know if this is expected behaviour.

Thanks.

thomasp85 commented 5 years ago

@platinum736 can I get you to open a new issue... this issue is not concerned with running lime on spark

platinum736 commented 5 years ago

Hi @thomasp85 ,

I am not running lime on spark, this is on plain pandas dataframe.

thomasp85 commented 5 years ago

Then you should post your question on the python repository :-)