scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.45k stars 25.27k forks source link

Add option to inspection.permutation_importance to only compute importance for a subset of features #18694

Open finnhacks42 opened 3 years ago

finnhacks42 commented 3 years ago

Describe the workflow you want to enable

We have datasets with very large numbers of features, where we only care about feature importance for a subset of them. This can happen when;

Currently permutation_importance computes always computes the importance for all features. This is slow and inefficient if you only want results for a small proportion of the features.

Describe your proposed solution

Add an optional list of indices specifying the features for which to compute importance to eg permutation_importance(...,features=None)

Then adapt the function to iterate only through the specified indicies if supplied or all columns by default (as happens currently).

column_indexes = range(X.shape[1]) if features is None else features 
scores = Parallel(n_jobs=n_jobs)(delayed(_calculate_permutation_scores)(
        estimator, X, y, col_idx, random_seed, n_repeats, scorer
) for col_idx in column_indexes)
thomasjpfan commented 3 years ago

I recall discussing this when developing permutation_importance and decided that there was not a use case for selecting a subset. Thank you for providing some use cases!

We want to compute feature importance for different features at different stages of the transformation pipeline.

Can you expand on this use case a little?

finnhacks42 commented 3 years ago

Sure! An example of computing feature importance at different stages of the pipeline:

The existing permutation_importance function can already be applied in this setting, eg

pipeline = make_pipeline(ColumnTransformer(...), Regressor(...)).fit(X_train,y_train)
importance_pre = permutation_importance(pipeline,X_test,y_test)
importance_post = permutation_importance(pipeline[1:],pipeline[0].transform(X_test),y_test)

The downside is that I will end up computing a lot of importances I don't care about - eg importances for individual survey questions in importance_pre and importances for one-hot encoded features in importance_post

glemaitre commented 3 years ago

So it would require to pass the features by name or indices and the output would contain np.nan/None if the permutation was not computed.

finnhacks42 commented 3 years ago

I was imagining you would pass features by index (it is easy enough to track names & convert to indices if required outside of the function) and then you would get back results with the same dimensionality as the features you passed in. I guess you could always return an output with results for all columns np.nan for those not computed. That just seems a little less intuitive to me and would also require a larger change to the code.

mayer79 commented 4 weeks ago

Stumbling about this as well. With the development of ColumnTransformer, having the option to use a subset of the columns would be inherently important.

The API could be as with partial_dependence, i.e., passing a list of column names or column indices. The output would follow the order of these.

We could, e.g., pass:

The last example would permute lat/lon together, mending an often heard critique that permutation importance fails for dependent features.

Gently pinging @glemaitre @thomasjpfan

This is how it looks in R:

library(ranger)
library(hstats)
library(shapviz)

n_train <- 1e4
train <- miami[1:n, ]
test <- miami[(n_train + 1):nrow(miami), ]

fit <- ranger(
  SALE_PRC ~ LATITUDE + LONGITUDE + LND_SQFOOT + TOT_LVG_AREA, data = train
)

xvars <- c("LATITUDE", "LONGITUDE", "LND_SQFOOT", "TOT_LVG_AREA")
perm_importance(fit, v = xvars, X = test, y = test$SALE_PRC, normalize = TRUE) |> 
  plot()

xgroups <- list(coords = c("LATITUDE", "LONGITUDE"), size = c("LND_SQFOOT", "TOT_LVG_AREA"))
perm_importance(fit, v = xgroups, X = test, y = test$SALE_PRC, normalize = TRUE) |> 
  plot()

Light         Dark

glemaitre commented 2 weeks ago

Thanks @mayer79 to bring back this discussion. I'll add this issue in the list of priorities when it comes to inspection.