interpretml / interpret

Fit interpretable models. Explain blackbox machine learning.
https://interpret.ml/docs
MIT License
6.04k stars 715 forks source link

Question about the visualization of the feature importance. #491

Open JWKKWJ123 opened 5 months ago

JWKKWJ123 commented 5 months ago

Hi all, When I used the visualize() to output the global/local explanations, I found that the visualize() can only consider up to 15 features (including pairwise features). I didn't find a parameter that I can edit in the visualize() function. I would like to ask is there a way to draw a plot include more than 15 features using interpretml? Thanks!

paulbkoch commented 5 months ago

It would be good to be able to configure this, but it's not something we currently support. You can however get the local explanation values and plot them however you like. To do this, use predict_and_contrib:

https://interpret.ml/docs/ExplainableBoostingClassifier.html#interpret.glassbox.ExplainableBoostingClassifier.predict_and_contrib

JWKKWJ123 commented 5 months ago

It would be good to be able to configure this, but it's not something we currently support. You can however get the local explanation values and plot them however you like. To do this, use predict_and_contrib:

https://interpret.ml/docs/ExplainableBoostingClassifier.html#interpret.glassbox.ExplainableBoostingClassifier.predict_and_contri Hi Paul, Thank you very much! As far as I know, interpretml use ploty to draw figures. I am learning to use ploty this week (I used matplotlib before). Now I can draw the similar plot as interpretml. However, I found the format became chaos when the number of features >=25, is it the reason why interpretml only include up to 15 features?

paulbkoch commented 5 months ago

What was the chaos in plotly? I think the main reason was because you get scroll bars in Jupyter Notebook if your cell is too big and, well, it's unreasonable to show thousands of terms if your model has that many, so they need to be clipped at some point. But I didn't write the UI, so @Harsha-Nori and @nopdive would know more.

JWKKWJ123 commented 5 months ago

What was the chaos in plotly? I think the main reason was because you get scroll bars in Jupyter Notebook if your cell is too big and, well, it's unreasonable to show thousands of terms if your model has that many, so they need to be clipped at some point. But I didn't write the UI, so @Harsha-Nori and @nopdive would know more.

Thanks for your reply! Now I am trying to train an EBM with 20-25 features (including pairwise features). The problem displayed may be caused by my lack of familiarity with ploty package. I think it can be solved.

paulbkoch commented 5 months ago

Hi @JWKKWJ123 - You can also kind of hack the UI to show you what you want by simply removing the first 15 terms from the model, which will then show you the next 15 terms if you visualize it afterwards. Just beware that the predicted value shown in the UI will no longer be valid since the model has been edited.

https://interpret.ml/docs/ExplainableBoostingClassifier.html#interpret.glassbox.ExplainableBoostingClassifier.remove_terms

JWKKWJ123 commented 5 months ago

Hi @paulbkoch , Thank you very much! I will both try to use the remove_terms(terms), and draw the plot bu my self using ploty.

emrynHofmannElephant commented 4 months ago

Hi @JWKKWJ123 - have you made any progress on this? - I've got a model with over 100 terms and would like to visualise the global importances (preferably via plotly). I know I can access the FI values via ebm.term_importances(), and produce a DataFrame, which can then be easily plotted

ebm_importances = pd.DataFrame(
    {
        "Feature Name": ebm.term_names_,
        "Importance (avg_weight)": ebm.term_importances(),
        "Importance (min_max)": ebm.term_importances(importance_type="min_max"),
    }
).sort_values(by="Importance (avg_weight)", ascending=False)
ebm_importances.plot(kind="barh", x="Feature Name", y="Importance (avg_weight)", figsize=(10,10))

But was wondering if there's a nicer way (especially to have it be the same style as the local explanations).

paulbkoch commented 4 months ago

It's possible to get more than 15 terms displayed by changing the top_n value here:

https://github.com/interpretml/interpret/blob/b9d577e7edeb854c3d6bb958c17f41a340fe7717/python/interpret-core/interpret/glassbox/_ebm/_ebm.py#L146

But there's a catch. This change doesn't increase the vertical height of the iframe in which it resides, so if you make top_n too large, the terms will be scrunched together. It's possible to zoom into regions though by dragging the mouse over the area you're interested in.

JWKKWJ123 commented 4 months ago

Hi all, Sorry for late reply. Now I need to show more than 15 features and I want to give different colors to different kinds of features, so I use the plotly package to draw the bar plot. For the global importance, I took the absolute number of all the local feature importance on training set and took average across subjects, it has the same result as: ebm.term_importances(importance_type='avg_weight'). I used ‘ebm.predict_and_contrib()’ because it can be used to calculate the global/local feature importance on training/testing set. And ebm.term_importances is specially used to calculate the global feature importance on training set. Although the plot need further improvement, it can work now:

import plotly
### Calculate global feature importance
names = ebm.term_names_
prediction_subjectlevel = ebm.predict_and_contrib(X_train)
prediction_subjectlevel = abs(prediction_subjectlevel[1])
grouplevel_featureimportance = np.average(prediction_subjectlevel,axis = 0)
grouplevel_featureimportance_index = np.argsort(-grouplevel_contributions).tolist()
list_featureimportance = []
for (term, importance) in zip(names, grouplevel_contributions):
    list_featureimportance.append([term,importance])
    #print(f"Term {term} importance: {importance}")

list = pd.DataFrame(list_featureimportance)
list= list.rename(columns={list.columns[0]:'Feature'})
list= list.rename(columns={list.columns[1]:'Importance'})
list=list.sort_values('Importance',ascending=True)
#choose the top 20 features
list = list[-20:]

### Plot the feature importance
#colors = ['red','orange','blue','orange','orange','red','orange','red','blue','orange','red','blue','blue','orange','grey','grey','grey','green','green','green']
#colors = ['blue']
fig = go.Figure(go.Bar(x=list["Importance"], y=list["Feature"],orientation='h',text=list["Importance"].round(2), name = 'Feature Importances of EBM',marker_color = colors))
fig.write_image('       .png')
JWKKWJ123 commented 4 months ago

It's possible to get more than 15 terms displayed by changing the top_n value here:

https://github.com/interpretml/interpret/blob/b9d577e7edeb854c3d6bb958c17f41a340fe7717/python/interpret-core/interpret/glassbox/_ebm/_ebm.py#L146

But there's a catch. This change doesn't increase the vertical height of the iframe in which it resides, so if you make top_n too large, the terms will be scrunched together. It's possible to zoom into regions though by dragging the mouse over the area you're interested in.

Dear Paul, I would like to ask does the "min_max" means the normalization? If it means normalization, should the order of features be the same between "min_max" and 'avg_weight'? Now I want the calculate the average global feature importance of different folds in cross validation. To be an additive model, is it reasonable to average the normalized importance of each feature in cross validation?

paulbkoch commented 4 months ago

"min_max" is something simpler. If you're looking at the graphs visually then "min_max" is the vertical difference between the highest point on the graph and the lowest point on the graph. The "avg_weight" importance of a feature/term can be calculated by looking up the contribution value on the feature/term's graph for each sample in the training set, then taking the absolute values, then averaging those. If you were to look at our code that does this calculation, you'd see that instead of iterating over all the samples, we use an equivalent method that leverages the bin weights that we preserve in the model. This has the advantage that we don't need access to the original training set to calculate it. Since "min_max" and "avg_weight" are different metrics, their orderings will not be identical. It is possible to imagine many other ways you might want to measure feature/term importances. Another one that we'll probably add at some point is the change in metrics like log loss when you remove each individual feature from the model. This will require adding an "X" parameter to the function since we'll then need access to a dataset. This would also add utility in terms of calculating importances on test sets, etc..

Averaging the feature/term importances across folds should work. If the folds have different sample weights, you might want to take the weighted average.

JWKKWJ123 commented 4 months ago

"min_max" is something simpler. If you're looking at the graphs visually then "min_max" is the vertical difference between the highest point on the graph and the lowest point on the graph. The "avg_weight" importance of a feature/term can be calculated by looking up the contribution value on the feature/term's graph for each sample in the training set, then taking the absolute values, then averaging those. If you were to look at our code that does this calculation, you'd see that instead of iterating over all the samples, we use an equivalent method that leverages the bin weights that we preserve in the model. This has the advantage that we don't need access to the original training set to calculate it. Since "min_max" and "avg_weight" are different metrics, their orderings will not be identical. It is possible to imagine many other ways you might want to measure feature/term importances. Another one that we'll probably add at some point is the change in metrics like log loss when you remove each individual feature from the model. This will require adding an "X" parameter to the function since we'll then need access to a dataset. This would also add utility in terms of calculating importances on test sets, etc..

Averaging the feature/term importances across folds should work. If the folds have different sample weights, you might want to take the weighted average.

Thank you so much for the detailed reply. I looked at our code that does this calculation. If I understand correctly, "min_max" is the maximum contribution of each feature in all samples (on training set) minus the minimum contribution (no absolute value is taken). If a feature have high contribution to a certain class for all the samples, then the ‘max – min’ will be a small number, is my understanding correct? I found the top k pairwise features are different in different folds/iterations so I didn’t include pairwise feature when I was calculating the average global feature/term importance (I haven’t think of a better way). I didn't pay attention to how the pairwise feature is calculated before, and I couldn't find it in the code (I guess it may be addition or multiplication after normalization). Can you recommend me a paper that introduces the specific calculation process of the pairwise feature?

paulbkoch commented 4 months ago

Yes, your understanding is correct. "min_max" is a better indicator of extreme contributions from a small number of samples vs "avg_weight". Both of these metrics have their place, but I'd generally recommend using "avg_weight" unless have a specific reason to use "min_max".

There are a lot of ways to handle the pairwise disagreement issues. With interactions there are sort of two ways to think about them in the context of having multiple models. One way to think about the problem is to say that if a pair is present in one model, but not the other, then it's essentially present in both models but has a contribution value of zero in the model where it is missing. Another way to think about the problem is to understand that the "interaction=10" parameter in the EBM constructor is merely a threshold and the pairwise interaction should really be in both models but just didn't make the cut to be included in the model where it is missing. We have the same issue in the implementation of the merge_ebms function where we have to decide how to merge two EBM models with pair disagreements. Right now merge_ebms assumes a contribution of zero, but I plan to add a blending option that will allow the caller to choose which assumption they want.

Internally in InterpretML we have another similar issue when constructing the outer_bags since the pairs can disagree there too. We handled it by first building the mains model, then we measure the pair strengths within each outer bag and come to a consensus set of pairs, then we continue boosting on the consensus set of pairs. It is possible today to replicate this process yourself, although it's a more advanced process that we hope to simplify in the future. Our documentation has an example of how you would do this:

https://interpret.ml/docs/python/examples/custom-interactions.html

I'm a little unclear on what you're asking regarding the pairwise term calculations. If you're asking how we choose which pairs to include in the model, the answer is that we continue to use the FAST algorithm from this paper: https://www.cs.cornell.edu/~yinlou/papers/lou-kdd13.pdf

If you're asking how we calculate the pair importances, the code that performs that calculation is here. It works for both mains and pairs:

https://github.com/interpretml/interpret/blob/45ee7a876bc5a05427af9109bed0cca9ea08ebf4/python/interpret-core/interpret/glassbox/_ebm/_ebm.py#L1819-L1833

And if you wanted to boil it down, this is the most critical line of that process. For interactions, mean_abs_score and self.binweights[i] are both tensors where the number of dimensions equals the number of features within the interaction:

https://github.com/interpretml/interpret/blob/45ee7a876bc5a05427af9109bed0cca9ea08ebf4/python/interpret-core/interpret/glassbox/_ebm/_ebm.py#L1826-L1828

JWKKWJ123 commented 4 months ago

Yes, your understanding is correct. "min_max" is a better indicator of extreme contributions from a small number of samples vs "avg_weight". Both of these metrics have their place, but I'd generally recommend using "avg_weight" unless have a specific reason to use "min_max".

There are a lot of ways to handle the pairwise disagreement issues. With interactions there are sort of two ways to think about them in the context of having multiple models. One way to think about the problem is to say that if a pair is present in one model, but not the other, then it's essentially present in both models but has a contribution value of zero in the model where it is missing. Another way to think about the problem is to understand that the "interaction=10" parameter in the EBM constructor is merely a threshold and the pairwise interaction should really be in both models but just didn't make the cut to be included in the model where it is missing. We have the same issue in the implementation of the merge_ebms function where we have to decide how to merge two EBM models with pair disagreements. Right now merge_ebms assumes a contribution of zero, but I plan to add a blending option that will allow the caller to choose which assumption they want.

Internally in InterpretML we have another similar issue when constructing the outer_bags since the pairs can disagree there too. We handled it by first building the mains model, then we measure the pair strengths within each outer bag and come to a consensus set of pairs, then we continue boosting on the consensus set of pairs. It is possible today to replicate this process yourself, although it's a more advanced process that we hope to simplify in the future. Our documentation has an example of how you would do this:

https://interpret.ml/docs/python/examples/custom-interactions.html

I'm a little unclear on what you're asking regarding the pairwise term calculations. If you're asking how we choose which pairs to include in the model, the answer is that we continue to use the FAST algorithm from this paper: https://www.cs.cornell.edu/~yinlou/papers/lou-kdd13.pdf

If you're asking how we calculate the pair importances, the code that performs that calculation is here. It works for both mains and pairs:

https://github.com/interpretml/interpret/blob/45ee7a876bc5a05427af9109bed0cca9ea08ebf4/python/interpret-core/interpret/glassbox/_ebm/_ebm.py#L1819-L1833

And if you wanted to boil it down, this is the most critical line of that process. For interactions, mean_abs_score and self.binweights[i] are both tensors where the number of dimensions equals the number of features within the interaction:

https://github.com/interpretml/interpret/blob/45ee7a876bc5a05427af9109bed0cca9ea08ebf4/python/interpret-core/interpret/glassbox/_ebm/_ebm.py#L1826-L1828

Thank you very much for the detailed reply! Now I am trying to average the among different folds/iterations because some people are interested in the confidence intervals of each feature importance among different folds/iterations. Now I already implemented this: image I was also trying to use the ‘merge_ebms’ function today, and I found I need to upgrade interpret package to the newest version to use the ‘merge_ebms’ function. However, I found the ebm.predict_and_contrib() no longer exist in the newest version, I am wondering is there a function that has the same function as the ‘ebm.predict_and_contrib()’? (to output the feature importance of a certain group of samples)

paulbkoch commented 4 months ago

The answer you seek is:

https://interpret.ml/docs/python/api/ExplainableBoostingClassifier.html#interpret.glassbox.ExplainableBoostingClassifier.eval_terms

JWKKWJ123 commented 4 months ago

The answer you seek is:

https://interpret.ml/docs/python/api/ExplainableBoostingClassifier.html#interpret.glassbox.ExplainableBoostingClassifier.eval_terms

Hi Paul, Thank you very much! Now I have one last question: I trained the model without intersections when I calculated the average feature importance among folds/iterations. But if I trained the model with intersections and I only show the average feature importance of original features, does the order of features still makes sense? I think the order of features may still make sense but the mean absolute contribution is not accurate in this case?

paulbkoch commented 4 months ago

Hi @JWKKWJ123 -- Can you post some code showing how you're doing these averages and making the folds. Doing these kinds of averages and comparing importances values and/or ordering them is really going to depend on the minutia of how these calculations are made.

JWKKWJ123 commented 4 months ago

Hi @JWKKWJ123 -- Can you post some code showing how you're doing these averages and making the folds. Doing these kinds of averages and comparing importances values and/or ordering them is really going to depend on the minutia of how these calculations are made.

Hi Paul, To make the code more concise, I only included the calculation of average global importance and CI, I use very simple way to calculate them.

import interpret
import plotly
import scipy.stats as st
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit

for i in range (n_iterations):
   #randomly split dataset for n iteration in binary classification task
   start = time.perf_counter()
   ss1=StratifiedShuffleSplit(n_splits=2,test_size=0.1,random_state= n_iterations)     
   train_index, test_index = ss1.split(X, Y)     
   test_index = train_index[1]
   train_index = train_index[0]
   train_data_size = len(train_index)
   test_data_size = len(test_index)
   X_train, X_test = X.iloc[train_index,:], X.iloc[test_index,:]
   y_train, y_test = Y[train_index] , Y[test_index]
   #train EBM
   ebm = ExplainableBoostingClassifier(interactions = 0) #I temporarily set the interactions to 0
   ebm.fit(X_train,y_train)
   local_contribution = ebm.eval_terms(X_train)
   local_contribution = np.average(abs(local_contribution),axis = 0)
   global_contribution = np.concatenate((global_contribution,local_contribution),axis = 1) 

names = ebm.term_names_
CI_importance = []
for k in range(len(names)):
   CI = compute_confidence(global_contribution[k], train_data_size,test_data_size) #I wrote a function to calculate the re-sampled t-test
   CI_importance.append((CI[1]-CI[0])/2)

global_contribution = np.average(global_contribution,axis =1)

#rank the features
for (term, importance, CI) in zip(names, importance_list, CI_importance):
    list_testset.append([term,importance,CI])

list = pd.DataFrame(list_testset)
list= list.rename(columns={list.columns[0]:'Feature'})
list= list.rename(columns={list.columns[1]:'Importance'})
list= list.rename(columns={list.columns[2]:'Errorbar'})
list=list.sort_values('Importance',ascending=True)

#Plot the Global feature importance
fig = go.Figure(go.Bar(x=list["Importance"], y=list["Feature"],error_x_array= list['Errorbar'],orientation='h', name = 'Feature Importances of EBM',marker_color = 'blue'))
paulbkoch commented 3 months ago

Looks reasonable to me.