LIME missing explanations with quantile_bins=T for discrete numeric fields

thomasp85 / lime

Local Interpretable Model-Agnostic Explanations (R port of original Python package)

https://lime.data-imaginist.com/

Other

486 stars 110 forks source link

LIME missing explanations with quantile_bins=T for discrete numeric fields #154

Closed vla6 closed 5 years ago

vla6 commented 5 years ago

I have noticed that when the input data contains discrete numeric fields, LIME ignores such fields in its explanations. This could be binary fields input into a model as numeric 0/1 values (some models take only numeric inputs). Or, it might be variables that just take a few distinct values because of rounding etc.

I am running lime via R. I am attaching a reproducible example.

The issue seems to occur for bin_continuous=T and quantile_bins=T in lime(). Setting quantile_bins=F appears to "fix" the issue. The relative proportions of the levels doesn't seem to matter much. I am wondering if there is possibly some issue with ties in quantile_bins. The exact values of the numeric levels doesn't seem to matter, and the same issue occurs for 2 or 3 distinct levels in numeric fields. Using a model type that takes factor inputs also "fixes" the issue.

Please let me know if you have comments or need more information. Thank you!

lime_vignette_test.txt

thomasp85 commented 5 years ago

quantile binning a binary variable is a pretty weird thing to do... I don't know what you expected the outcome to be...

If you can come up with a reasonable behaviour I'll be open to change

vla6 commented 5 years ago

Thank you for the reply. It is a strange thing to do, but the behavior can occur with variables that aren't technically binary but which take just a set number of levels. For instance, counts of rare events, where maybe you only see 0, 1, 2 in your data set. Also the variables can be very influential but are ignored which is confusing.

Is there a place that a binning error is thrown? It may be enough to print a warning or maybe document the data prep concern, but I think there may be other options? dplyr::ntile() puts observations in a order, using a random order when there are ties, then assigns them into quantiles using that order, so you always get the number of bins you specify. That could be an option. However, it may be better to fall back on continuous binning when quantile binning fails if that's possible.

thomasp85 commented 5 years ago

The "problem" is that with quantile binning and only two outcomes the cuts will be placed at either side of all possible permuted values meaning that the permuted values will have zero variance and be discarded for further use...

falling back to standard binning may be a good idea..?

vla6 commented 5 years ago

Yes it's a strange case. Thanks for the reply! I think falling back to standard binning would be good in that case if that is possible.

I see the behavior with 3 levels as well as binary but maybe that has to do with distributions of outcomes.

thomasp85 commented 5 years ago

I think - A quick test with binning bare nuclei into three bins keeps it in my case

vla6 commented 5 years ago

Thank you so much!