interpretml / interpret

Fit interpretable models. Explain blackbox machine learning.
https://interpret.ml/docs
MIT License
6.04k stars 714 forks source link

What is the baseline and/or best way to group correlated features? #487

Closed hoangthienan95 closed 3 weeks ago

hoangthienan95 commented 6 months ago

First of all, thank you for the amazing package, and the dedication/effort to answer all the questions so quickly and thoroughly! The documentation and Github issues are a gold mine of knowledge.

This is related to #405, #179 and #232 , where you have a dataset with alot of correlated features to varying degrees. I appreciate that EBM will split the contributions of correlated features so I can later choose which feature to keep. I want to group features by subsets so I can use compute_group_importance to compare importance within and between groups.

However, I'm facing the dilemma of how to group the features optimally. Some things I have done in the past but did not yield optimal results:

1) Calculate a matrix of Mutual Information metric (to account for non-linear correlation) between features then cluster that matrix of dimensions (n_features, n_features).

2)Cluster on local explanation matrix of all the features for the whole dataset (n_features, n_samples), or use UMAP to reduce local explanation matrix to (n_features, 2) dimension then cluster that

What is the optimal way to do this? Or what is a quick/semi-automated baseline I can use to group features (to improve upon or to compare with my manual grouping of features)?

Thanks!

hoangthienan95 commented 5 months ago

Just want to check in on this to see if anyone has an answer. Let me know if I should add any information or simplify. Sorry new to interpret/EBM

richcaruana commented 5 months ago

Good questions! Sometimes the way to group features is not related to the correlation among the features. For example, in healthcare we sometimes group features into those features we can affect, and those features we can't change, because clinicians are mainly focused on what they can (or can't) do to help the patient. They can control body temperature, blood pressure, creatinine levels through different kinds of interventions, but they can't change age, gender, and medical history. The reason to group features this way is to find patients where there is something the clinicians can do to reduce the patient's risk, as opposed to patients where all or most of the risk is from things they can't change. Other groupings we commonly use in healthcare are all medications in one group, all lab test results in another group, everything due to the patient's prior health in a separate group, etc. In other words, often there are logical/meaningful groups of features that the end user wants to see the importance of.

In other settings you might want to do exactly what you propose --- take all features which are variations of the same information (and thus highly correlated) and put them into a group so you can see the overall importance of that source of information. For example, if there are multiple features associated with race, income, education, socio-economic class, job, etc., you might want to put these in a group and see how important social determinants are to what you are trying to predict. Because these are likely to be correlated with each other, you could imagine a clustering algorithm that would try to automatically cluster/group them together. But it depends on what you are trying to accomplish how you would want to do this. I don't think there is an "optimal" grouping of features unless you can define what optimality is with respect to.

One thing that is very important to keep in mind is that grouping features has no effect on how the model is trained, what the term graphs look like, or what predictions the model makes. They are only a tool for grouping features so that you can see an estimate of the importance of different groups of features, as opposed to only seeing the importance of features measured individually. So grouping is just a tool to help you better understand the relative importance of different sources of information in the model, including sources that now include groups of features. You can have the same feature show up in multiple groups and everything still works fine. One group we sometimes create is the group of all features! This shows us the average "importance" of the entire group of features, and gives us a kind of maximum baseline to help normalize the importance of different subsets of features, or of features taken independently.