LIME vignette misleading

PabloRMira commented 6 years ago

Hi Thomas

first of all, thank you very much for this great package!

I simply wanted to point out that the explanation given in your vignette (https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html) regarding the kernel_width option is misleading and the difference between the two results obtained by adjusting the kernel_width is just due to randomness. As I found out, this is because with the default dist_fun ("gower") the kernel and the kernel_width do not play a role at all for computing anything in the explain.data.frame function (https://github.com/thomasp85/lime/blob/master/R/dataframe.R).

To make my point clear, I simply used the code in your vignette setting a seed before the two explanations are computed (see the example below). As you can see, the difference in model_r2 is not there anymore, corroborating my supposition.

Best regards, Pablo

# Reproducible example

library(MASS)
library(lime)
data(biopsy)

# First we'll clean up the data a bit
biopsy$ID <- NULL
biopsy <- na.omit(biopsy)
names(biopsy) <- c('clump thickness', 'uniformity of cell size', 
                   'uniformity of cell shape', 'marginal adhesion',
                   'single epithelial cell size', 'bare nuclei', 
                   'bland chromatin', 'normal nucleoli', 'mitoses',
                   'class')

# Now we'll fit a linear discriminant model on all but 4 cases
set.seed(4)
test_set <- sample(seq_len(nrow(biopsy)), 4)
prediction <- biopsy$class
biopsy$class <- NULL
model <- lda(biopsy[-test_set, ], prediction[-test_set])

set.seed(123) # <------ NEW!
explainer <- lime(biopsy[-test_set,], model, bin_continuous = TRUE, quantile_bins = FALSE)
explanation <- explain(biopsy[test_set, ], explainer, n_labels = 1, n_features = 4)
# Only showing part of output for better printing
explanation[, 2:9]
#>    case     label label_prob  model_r2 model_intercept model_prediction
#> 1   416    benign  0.9943635 0.5473753       0.1177513        1.0081102
#> 2   416    benign  0.9943635 0.5473753       0.1177513        1.0081102
#> 3   416    benign  0.9943635 0.5473753       0.1177513        1.0081102
#> 4   416    benign  0.9943635 0.5473753       0.1177513        1.0081102
#> 5     7    benign  0.9527375 0.6524831       0.6516834        0.3432129
#> 6     7    benign  0.9527375 0.6524831       0.6516834        0.3432129
#> 7     7    benign  0.9527375 0.6524831       0.6516834        0.3432129
#> 8     7    benign  0.9527375 0.6524831       0.6516834        0.3432129
#> 9   207 malignant  0.9999854 0.1701703       0.2978744        0.7120728
#> 10  207 malignant  0.9999854 0.1701703       0.2978744        0.7120728
#> 11  207 malignant  0.9999854 0.1701703       0.2978744        0.7120728
#> 12  207 malignant  0.9999854 0.1701703       0.2978744        0.7120728
#> 13  195    benign  0.9999977 0.5529188       0.1288413        1.0438876
#> 14  195    benign  0.9999977 0.5529188       0.1288413        1.0438876
#> 15  195    benign  0.9999977 0.5529188       0.1288413        1.0438876
#> 16  195    benign  0.9999977 0.5529188       0.1288413        1.0438876
#>                    feature feature_value
#> 1          normal nucleoli             5
#> 2              bare nuclei             3
#> 3          clump thickness             3
#> 4  uniformity of cell size             3
#> 5                  mitoses             1
#> 6              bare nuclei            10
#> 7          clump thickness             1
#> 8  uniformity of cell size             1
#> 9                  mitoses             1
#> 10         clump thickness            10
#> 11 uniformity of cell size            10
#> 12         bland chromatin             3
#> 13                 mitoses             1
#> 14             bare nuclei             1
#> 15         clump thickness             3
#> 16 uniformity of cell size             1

set.seed(123) # <------ NEW!
explanation <- explain(biopsy[test_set, ], explainer, n_labels = 1, n_features = 4, kernel_width = 0.5)
explanation[, 2:9]
#>    case     label label_prob  model_r2 model_intercept model_prediction
#> 1   416    benign  0.9943635 0.5473753       0.1177513        1.0081102
#> 2   416    benign  0.9943635 0.5473753       0.1177513        1.0081102
#> 3   416    benign  0.9943635 0.5473753       0.1177513        1.0081102
#> 4   416    benign  0.9943635 0.5473753       0.1177513        1.0081102
#> 5     7    benign  0.9527375 0.6524831       0.6516834        0.3432129
#> 6     7    benign  0.9527375 0.6524831       0.6516834        0.3432129
#> 7     7    benign  0.9527375 0.6524831       0.6516834        0.3432129
#> 8     7    benign  0.9527375 0.6524831       0.6516834        0.3432129
#> 9   207 malignant  0.9999854 0.1701703       0.2978744        0.7120728
#> 10  207 malignant  0.9999854 0.1701703       0.2978744        0.7120728
#> 11  207 malignant  0.9999854 0.1701703       0.2978744        0.7120728
#> 12  207 malignant  0.9999854 0.1701703       0.2978744        0.7120728
#> 13  195    benign  0.9999977 0.5529188       0.1288413        1.0438876
#> 14  195    benign  0.9999977 0.5529188       0.1288413        1.0438876
#> 15  195    benign  0.9999977 0.5529188       0.1288413        1.0438876
#> 16  195    benign  0.9999977 0.5529188       0.1288413        1.0438876
#>                    feature feature_value
#> 1          normal nucleoli             5
#> 2              bare nuclei             3
#> 3          clump thickness             3
#> 4  uniformity of cell size             3
#> 5                  mitoses             1
#> 6              bare nuclei            10
#> 7          clump thickness             1
#> 8  uniformity of cell size             1
#> 9                  mitoses             1
#> 10         clump thickness            10
#> 11 uniformity of cell size            10
#> 12         bland chromatin             3
#> 13                 mitoses             1
#> 14             bare nuclei             1
#> 15         clump thickness             3
#> 16 uniformity of cell size             1

Created on 2018-09-10 by the reprex package (v0.2.0).

scworland commented 6 years ago

The documentation indicates this on page 5,

"kernel_width: The width of the exponential kernel that will be used to convert the distance to a similarity in case dist_fun != 'gower'."

But you are correct that the vignette is misleading for the provided example as it suggest that the default distance function is Euclidean while the documentation says it is Gower.

thomasp85 commented 6 years ago

Thanks for pointing it out. This is an oversight after switching to gower

thomasp85 / lime

LIME vignette misleading #121