Implement automatic feature selection methods

markdregan commented 1 year ago

This PR addresses #219

Work tasks:

[x] Implement automatic feature selection method _get_best_num_features()
- [x] 1. best: strictly selects the num_features with the highest model score.
- [x] 2. best_coherent: For iterations that are within standard_error_threshold of the highest score, select the iteration with the lowest standard deviation of model score.
- [x] 3. best_parsimonious: For iterations that are within standard_error_threshold of the highest score, select the iteration with the fewest features.
[x] Implement ability to get support (boolean mask of features selected)
[x] Implement ability to get feature ranking
[x] Update doc strings throughout
[x] Update documentation
[x] Add unit tests

markdregan commented 1 year ago

I have a strawman in place for this PR. I have one outstanding question on how ranking_ should be computed.

My assumptions:

For each RFECV iteration, we store list of feature names in self.report_df["features_set"] in ranked order (most to least importance)
The list of eliminated_features is also stored in ranked order??

Proposal:

To get a ranking of feature importances for a given RFECV iteration, we can rely on concatenating features stored in features_set + eliminated_features (for the best/selected iteration) + eliminated_features (for the previous RFECV iteration) + ... . (request feedback on this)

Edge cases:

Shap value ties. Because we don't have access to shap values directly in report_df, we can't assign same ranking to features that have the same shap importance.
min_features_to_select: The last iteration of the RFECV can have > 1 features. Should I assume these are ranked in shap importance too? (avoiding ties and same rank features)

ReinierKoops commented 1 year ago

eliminated_features is ordered in order of importance [most, ..., least]. However, features_set is not in order of importance. The trade-off here was we want to stay as close to the insertion order of how the columns were provided by the user. But, since its not the case for eliminated_features, I understand this might be a bit confusing.

Shap value ties: While it is unlikely, when a tie occurs, it would be nice to have some condition that we could take into account when ranking. I'd say by default leave it as it is. However, I'd imagine you'd might want to rank another feature higher when their variance is lower if you were to use that measure (related to your other pr).
min_features_to_select: So this points back to the previous difference. the features_set is in the order of how the user provided the columns.

The origin of this is that (some) algorithms are sensitive to the order of the columns/data provided (for example when it has to choose which column to drop, and let's say its a tie, it'll keep the first, and drop the latter). Next to that, we believe that users provide the columns in the order that they think is the best. In most cases the user might not be aware of that or chooses that any order to insert into the algorithm is fine, however for the users that provide a specific order, we ensure with this implementation that it is taken into account when doing SHAPRFECV.

Does this answer your questions? If not please let me know :)

markdregan commented 1 year ago

Very helpful feedback. TY.

I'm now questioning if it is possible to extract feature_ranking from self.report_df alone. I think it is possible. Only eliminated_features is ordered by shap importance. So, the concatenation of eliminated_features is actually a ranked list of all features by importance. This should be an ok V1 which can be improved upon (eg. with ties etc).

I'll prototype this and see if it makes sense. Thanks again.

ReinierKoops commented 1 year ago

Also, in the method calculate_shap_importance we generate the shap_importance_df that is ranked in order of [most, ... least]. Then we use _get_current_features_to_remove to remove the last n of it.

markdregan commented 1 year ago

Thanks for feedback @ReinierKoops - I believe I've addressed all the feedback.

I think a separate PR might investigate improving the ranking logic to incorporate calculate_shap_importance. Some plumbing of these scores/ranking through to report_df would be needed.

markdregan commented 1 year ago

Closed by accident.

ing-bank / probatus

Implement automatic feature selection methods #220