ing-bank / probatus

Validation (like Recursive Feature Elimination for SHAP) of (multiclass) classifiers & regressors and data used to develop them.
https://ing-bank.github.io/probatus
MIT License
132 stars 40 forks source link

Implement automatic feature selection methods #220

Closed markdregan closed 1 year ago

markdregan commented 1 year ago

This PR addresses #219

Work tasks:

markdregan commented 1 year ago

I have a strawman in place for this PR. I have one outstanding question on how ranking_ should be computed.

My assumptions:

Proposal:

Edge cases:

ReinierKoops commented 1 year ago

eliminated_features is ordered in order of importance [most, ..., least]. However, features_set is not in order of importance. The trade-off here was we want to stay as close to the insertion order of how the columns were provided by the user. But, since its not the case for eliminated_features, I understand this might be a bit confusing.

The origin of this is that (some) algorithms are sensitive to the order of the columns/data provided (for example when it has to choose which column to drop, and let's say its a tie, it'll keep the first, and drop the latter). Next to that, we believe that users provide the columns in the order that they think is the best. In most cases the user might not be aware of that or chooses that any order to insert into the algorithm is fine, however for the users that provide a specific order, we ensure with this implementation that it is taken into account when doing SHAPRFECV.

Does this answer your questions? If not please let me know :)

markdregan commented 1 year ago

Very helpful feedback. TY.

I'm now questioning if it is possible to extract feature_ranking from self.report_df alone. I think it is possible. Only eliminated_features is ordered by shap importance. So, the concatenation of eliminated_features is actually a ranked list of all features by importance. This should be an ok V1 which can be improved upon (eg. with ties etc).

I'll prototype this and see if it makes sense. Thanks again.

ReinierKoops commented 1 year ago

Also, in the method calculate_shap_importance we generate the shap_importance_df that is ranked in order of [most, ... least]. Then we use _get_current_features_to_remove to remove the last n of it.

markdregan commented 1 year ago

Thanks for feedback @ReinierKoops - I believe I've addressed all the feedback.

I think a separate PR might investigate improving the ranking logic to incorporate calculate_shap_importance. Some plumbing of these scores/ranking through to report_df would be needed.

markdregan commented 1 year ago

Closed by accident.