Add option to inspection.permutation_importance to only compute importance for a subset of features

finnhacks42 commented 3 years ago

Describe the workflow you want to enable

We have datasets with very large numbers of features, where we only care about feature importance for a subset of them. This can happen when;

The model pipeline includes a transform in which only a subset of features are selected to pass on to the final estimator (eg via ColumnTransformer) and we know in advance that other columns can have no impact on the final model.
We want to compute feature importance for different features at different stages of the transformation pipeline.

Currently permutation_importance computes always computes the importance for all features. This is slow and inefficient if you only want results for a small proportion of the features.

Describe your proposed solution

Add an optional list of indices specifying the features for which to compute importance to eg permutation_importance(...,features=None)

Then adapt the function to iterate only through the specified indicies if supplied or all columns by default (as happens currently).

column_indexes = range(X.shape[1]) if features is None else features 
scores = Parallel(n_jobs=n_jobs)(delayed(_calculate_permutation_scores)(
        estimator, X, y, col_idx, random_seed, n_repeats, scorer
) for col_idx in column_indexes)

thomasjpfan commented 3 years ago

I recall discussing this when developing permutation_importance and decided that there was not a use case for selecting a subset. Thank you for providing some use cases!

We want to compute feature importance for different features at different stages of the transformation pipeline.

Can you expand on this use case a little?

finnhacks42 commented 3 years ago

Sure! An example of computing feature importance at different stages of the pipeline:

We have some categorical variables that are put through a OneHotEncoder. For these variables we want to do the permutation before the encoding to get an indication of the overall role each categorical variable plays at a high level.
We have a very large set of variables that are responses to different sets of survey questions. For these variables we have a custom transformer that essentially computes a weighted sum of the survey responses within each set of questions. So for example, there might be five questions related to depression which are combined to form a single "depression" variable. For these variables we want to compute feature importance at the level of the aggregate, post-transform variables, rather than the individual question variables.
These transforms are applied to the relevant set of columns via a ColumnTransformer (or DataFrameMapper) and the whole thing is occurring inside a cross-val loop.

The existing permutation_importance function can already be applied in this setting, eg

pipeline = make_pipeline(ColumnTransformer(...), Regressor(...)).fit(X_train,y_train)
importance_pre = permutation_importance(pipeline,X_test,y_test)
importance_post = permutation_importance(pipeline[1:],pipeline[0].transform(X_test),y_test)

The downside is that I will end up computing a lot of importances I don't care about - eg importances for individual survey questions in importance_pre and importances for one-hot encoded features in importance_post

glemaitre commented 3 years ago

So it would require to pass the features by name or indices and the output would contain np.nan/None if the permutation was not computed.

finnhacks42 commented 3 years ago

I was imagining you would pass features by index (it is easy enough to track names & convert to indices if required outside of the function) and then you would get back results with the same dimensionality as the features you passed in. I guess you could always return an output with results for all columns np.nan for those not computed. That just seems a little less intuitive to me and would also require a larger change to the code.

mayer79 commented 4 weeks ago

Stumbling about this as well. With the development of ColumnTransformer, having the option to use a subset of the columns would be inherently important.

The API could be as with partial_dependence, i.e., passing a list of column names or column indices. The output would follow the order of these.

We could, e.g., pass:

features = [0, 1, 4, 5]
features = ["latitude", "longitude", "land_size", "living_area"], and later
features = [("latitude", "longitude"), "land_size", "living_area"]

The last example would permute lat/lon together, mending an often heard critique that permutation importance fails for dependent features.

Gently pinging @glemaitre @thomasjpfan

This is how it looks in R:

library(ranger)
library(hstats)
library(shapviz)

n_train <- 1e4
train <- miami[1:n, ]
test <- miami[(n_train + 1):nrow(miami), ]

fit <- ranger(
  SALE_PRC ~ LATITUDE + LONGITUDE + LND_SQFOOT + TOT_LVG_AREA, data = train
)

xvars <- c("LATITUDE", "LONGITUDE", "LND_SQFOOT", "TOT_LVG_AREA")
perm_importance(fit, v = xvars, X = test, y = test$SALE_PRC, normalize = TRUE) |> 
  plot()

xgroups <- list(coords = c("LATITUDE", "LONGITUDE"), size = c("LND_SQFOOT", "TOT_LVG_AREA"))
perm_importance(fit, v = xgroups, X = test, y = test$SALE_PRC, normalize = TRUE) |> 
  plot()

Light Dark

glemaitre commented 2 weeks ago

Thanks @mayer79 to bring back this discussion. I'll add this issue in the list of priorities when it comes to inspection.

scikit-learn / scikit-learn

Add option to inspection.permutation_importance to only compute importance for a subset of features #18694

Describe the workflow you want to enable

Describe your proposed solution