thomasp85 / lime

Local Interpretable Model-Agnostic Explanations (R port of original Python package)
https://lime.data-imaginist.com/
Other
481 stars 109 forks source link

permute_cases: Error arguments imply differing number of rows: 30000, 0 #173

Open agilebean opened 4 years ago

agilebean commented 4 years ago

Dear lime contributors, thanks for your awesome work on this repository. Alas, I got an error that took me several days to figure out, and is reproducible:

explanation.lime <- lime::explain(
  x = local.obs,
  explainer = explainer.lime,
  n_features = 5 
)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  arguments imply differing number of rows: 30000, 0

Fortunately I reached a point that I not only could narrow down the location of the source code but also the conditions that trigger it - but not completely, so I hope you figure out the last mile.

The condition that triggers it is a column in the cases argument of permute_cases that has zero variance and is integer, in my case it is column reviews.numHelpful

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   6 obs. of  13 variables:
 $ reviews.doRecommend: Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2
 $ reviews.numHelpful : int  0 0 0 0 0 0
 $ reviews.rating     : int  4 4 4 5 5 5
 $ anger              : num  0 0 0 0 0 0

This column leads to an empty output within the permute_cases.data.frame function in the lines identfying the "bin" ifelse statement:

} else if (is.numeric(cases[[i]]) && bin_continuous) {
      bin <- sample(seq_along(feature_distribution[[i]]), nrows, TRUE, as.numeric(feature_distribution[[i]]))
      diff(bin_cuts[[i]])[bin] * runif(nrows) + bin_cuts[[i]][bin]
    }

which can be seen here:

$ : Factor w/ 2 levels "1","2": 1 2 1 2 2 2 2 1 2 1 ...
$ : int(0)
$ : int [1:30000] 14 5 5 19 31 10 27 7 10 10 ...
$ : num [1:30000] 0.021654 0.081145 0.039533 0.000972 0.029057 ...

I disentangled the type conversion to dataframe and thus found that this throws the above error:

perms <- as.data.frame(perms, stringsAsFactors = FALSE)

The feature_distribution[[2]] gives:

     FALSE       TRUE 
0.04648887 0.95351113 

This is wrong! This result should come from the only factor, i.e. the first column and thus rendered by feature_distribution[[2]]! Consequently, the next line diff(bin_cuts[[2]])[bin] always returns NULL which leads to an empty return value integer(0)

So far, I could narrow the root cause to this point - but I am clueless what diff(bin_cuts[[2]])[bin] means and how this can be prevented.

Update

I found a potential reason for this apparent index problem. The feature distribution includes the target variable .outcome as first list item, and thus all indeces are wrong by offset 1:

$.outcome
        1         2 
0.3277057 0.6722943 

$reviews.doRecommend
     FALSE       TRUE 
0.04648887 0.95351113

$reviews.numHelpful
           1            2            3            4 
0.9981241334 0.0012233912 0.0001631188 0.0004893565 

$anger
          1           2           3           4 
0.911100237 0.065900008 0.013620422 0.009379333

However, the target variable is inevitable because the documentation for ?lime specifies:

x The training data used for training the model that should be explained.

So the training data (including the target), not the features (excluding the target) must be fed into lime::lime(). Now I wonder:

_Is this a problem inlime::lime() or permutate_cases()??_

Can you fix this?? Tricky...