Closed lindskogniklas closed 6 years ago
Great question. I think there are advantages and disadvantages to both, and I ended up choosing the first. For rationale and discussion, see #63 and #59, but note that we weight the data after generating the neighbors, so this discussion is mostly about how efficient we are in sampling. Even if we sample a bunch of points far from our current point, they will get weighted down by the distance function / kernel.
Thank you for the rationale links. As you might have suspected already, the problem I am having has to do with outliers.
By sampling around the mean for an outlier, most samples will be too far and weighted down so much that the effects of all features will be squashed (as opposed to what I'd expect, that only the locally irrelevant features would not show with any significant weight). This is exactly what I've been seeing in my data set with tons of "normal" samples and very few abnormal ones (which I'd like to detect), which results in a very tight standard deviation around the mean that has close to zero chance of generating relevant samples in the target sample's neighborhood.
A thought on your final answer in #63 btw, feel free to correct me if I am wrong: In the imaginary situation described with the really extreme value (age > 98) which is way outside the classifiers thresholds. Isn't "the age doesn't matter" actually the right answer locally for that sample?
Given that there seems to have been some discussion regarding this and we cannot conclude that one is a better method than the other; would you be open to adding a sampling kwarg to let the programmer decide sampling method in explain_instance?
A thought on your final answer in #63 btw, feel free to correct me if I am wrong: In the imaginary situation described with the really extreme value (age > 98) which is way outside the classifiers thresholds. Isn't "the age doesn't matter" actually the right answer locally for that sample?
That is a good point. Of course locality does not depend on a single feature, still it's a good point.
Given that there seems to have been some discussion regarding this and we cannot conclude that one is a better method than the other; would you be open to adding a sampling kwarg to let the programmer decide sampling method in explain_instance?
Definitely, we should add that option to LimeTabularExplainer. Do you want to do a pull request? :)
Unfortunately, I cannot contribute to open source according to my contract or rather - I can, but the result is owned by the company. So as much as I would like to do a pull request it would just make things complicated. :-P
Should I involve a third party to do it for me or is it easier for you to fix it yourself?
Wonderful. Thank you!
I was looking at the code in __data_inverse in lime_tabular.
Why are the samples are taken around the mean instead of the sample here I might have misunderstood, but isn't this sampling technique causing outlier samples to get distant neighborhoods?
Given that I'm right, I would suggest to use the sample as center for the sampling instead: I.e.
data = data * self.scaler.scale_ + data_row
instead ofdata = data * self.scaler.scale_ + self.scaler.mean_
In my mind, this would "guarantee" a neighborhood closely located to the sample regardless of the characteristics of the sample... which I suppose is what we're after?