marcotcr / lime

Lime: Explaining the predictions of any machine learning classifier
BSD 2-Clause "Simplified" License
11.61k stars 1.81k forks source link

LIME vs feature importance #180

Closed germayneng closed 5 years ago

germayneng commented 6 years ago

Hi,

I have a question regarding the feature importance vs LIME.

For the adult data set, we can see the feature importance of my model -

importance

However, when I set plotting various lime plots, I will post a few:

1

2

3

I ran around 20 plots and mostly we can see for example the variable marital status to be used in the decision. However, for the feature importance, it is slightly low. Is there a reason for this?

feature importance let us know that the more important features are on the higher nodes for splitting. For the LIME, it is ordered by values. Is it correct to understand that more important features does not necessarily means that it will result in larger gain/loss in the LIME?

bbennett36 commented 6 years ago

Total 'gain' for capital gain = 0.10

total 'gain' for capital for class >50k = 0.5 total 'gain' for capital for class <50k = 0.5

0.5 + 0.5 = 0.10

The first plot is the total.

The next 2 plots are showing the total for each class which added is would show the first plot.

germayneng commented 6 years ago

@bbennett36 are you saying that for the feature importance ( which is the first plot) is the average of all the gains from the LIME plots? Because the 3 lime plots are only 3 random points of a test set..

Also, if the feature importance is the total, it does not make sense because age is the highest. But out of 20 over plots of 20 random points, gain of age is not that high at all

bbennett36 commented 6 years ago

No. I just noticed your plots are showing feature importance only for 'Class > 50k' (Which I'm summing is a classification problem). I'm going to guess that if you look at the gain for 'Class < 50k', and add up the gain for both classes, it will equal to the totals that you're seeing in the first plot.

Does that make sense? It looks like your only looking at the gain for 1 class and wondering why it's not equal to the total. You need to plot the other class and see if they add up.

germayneng commented 6 years ago

@bbennett36

The first plot is the overall plot of feature importance from the model itself.

The subsequent plots are the LIME plot based off random points of a test set. As such, i do not understand how the logic of adding the plots from the lime to explain the feature importance plot. Each prediction can be interpreted from the LIME but just summing up the gain / losses to obtain the probability.. Also, this is indeed a classification problem for identifying if class > 50.

Isnt feature importances based off how high it is on the tree and as such, the number represents the fraction of the input samples. (see : http://scikit-learn.org/stable/modules/ensemble.html#random-forest-feature-importance). higher fraction == more splits are based off this feature, It doesnt translate directly to the LIME, which are the weights.

So, tldr; i believe you are interpreting the feature importance plot wrong.

What i am not making sense is that why does the feature importance seems to contradict the lime plots. If age is of higher importance, does this mean that for the local interpretation by lime, each prediction should use AGE as a higher weightage or does this not hold true?

mizukasai commented 6 years ago

@germayneng Have you checked how well is LIME explaining your model ? What's the approximation error ?

germayneng commented 6 years ago

@m212 what is the function to obtain the approximation error for LIME?

mizukasai commented 6 years ago

@germayneng lime.explanation.Explanation class has the attribute score for example : explainer = lime.lime_tabular.LimeTabularExplainer(train) exp = explainer.explain_instance(sample) exp.score

marcotcr commented 6 years ago

Adding LIME explanations up should not result in the feature importance weights - @bbennett36 is interpreting the feature importance graph incorrectly.

@germayneng You are correct: more important features according to feature importance in random forests are not necessarily going to show up with higher weights with LIME. Some features may have a lot of impact on individual predictions, but may be fragmented across the tree and thus get low feature importance. One quick thing you can do to check explanations is to test them - for these points you got explanations for, try perturbing capital gain , education and age and see the impact that those changes have.

germayneng commented 6 years ago

@m212 i have the exp.score of 0.38905211797144262. how do i interpret this?

@marcotcr Thank you for your reply! Ok i understood. Basically from another example here I can see that higher features (higher feature importance) does not always result in highest / lowest gain / loss when it comes to predicting the target. What the higher importance feature does is to do the splitting on the top most nodes.

Also when you mean perturbing, is it as per your tutorial examples where you run through all the various values for a particular variable to identify the impact on the target variable while keeping other variable constant?

marcotcr commented 6 years ago

Yes, but you can also do it with multiple variables at a time. Basically that is how LIME is coming up with these weights in the first place.

germayneng commented 6 years ago

@marcotcr can i say that the concept of lime pertube is similar to partial dependence plot except it is being localized ?

Also may i ask what are the default range to pertube the variables? Do you have a standard practice?

mizukasai commented 6 years ago

@germayneng The score you see is the sklearn.Ridge score for the perturbed data and the labels predicted by your model. It is an R² score @marcotcr is there a score limit after which we can no longer consider that LIME is approximating the model well locally ?

germayneng commented 6 years ago

@m212 that make sense since LIME uses ridge regression under the hood. But isnt R2 not a good gauge since we can add more variables and it will increase

marcotcr commented 6 years ago

@germayneng I don't think it is similar to PDP - you don't see how the output changes as a function of the input for numerical features, you don't look at a feature at a time, etc. For perturbing: if the data is categorical or discretized, we sample from a multinomial with probabilities given by the distribution in the training data. For continuous (non discretized) data we sample from a normal with mu and sigma estimated from the trianing data.

@mizukasai I think that threshold is application dependent, I don't think I can come up with a threshold for everyone : )