Open finnhacks42 opened 3 years ago
I recall discussing this when developing permutation_importance
and decided that there was not a use case for selecting a subset. Thank you for providing some use cases!
We want to compute feature importance for different features at different stages of the transformation pipeline.
Can you expand on this use case a little?
Sure! An example of computing feature importance at different stages of the pipeline:
OneHotEncoder
. For these variables we want to do the permutation before the encoding to get an indication of the overall role each categorical variable plays at a high level.ColumnTransformer
(or DataFrameMapper
) and the whole thing is occurring inside a cross-val loop. The existing permutation_importance
function can already be applied in this setting, eg
pipeline = make_pipeline(ColumnTransformer(...), Regressor(...)).fit(X_train,y_train)
importance_pre = permutation_importance(pipeline,X_test,y_test)
importance_post = permutation_importance(pipeline[1:],pipeline[0].transform(X_test),y_test)
The downside is that I will end up computing a lot of importances I don't care about - eg importances for individual survey questions in importance_pre
and importances for one-hot encoded features in importance_post
So it would require to pass the features by name or indices and the output would contain np.nan
/None if the permutation was not computed.
I was imagining you would pass features by index (it is easy enough to track names & convert to indices if required outside of the function) and then you would get back results with the same dimensionality as the features you passed in. I guess you could always return an output with results for all columns np.nan
for those not computed. That just seems a little less intuitive to me and would also require a larger change to the code.
Stumbling about this as well. With the development of ColumnTransformer
, having the option to use a subset of the columns would be inherently important.
The API could be as with partial_dependence, i.e., passing a list of column names or column indices. The output would follow the order of these.
We could, e.g., pass:
features = [0, 1, 4, 5]
features = ["latitude", "longitude", "land_size", "living_area"]
, and laterfeatures = [("latitude", "longitude"), "land_size", "living_area"]
The last example would permute lat/lon together, mending an often heard critique that permutation importance fails for dependent features.
Gently pinging @glemaitre @thomasjpfan
This is how it looks in R:
library(ranger)
library(hstats)
library(shapviz)
n_train <- 1e4
train <- miami[1:n, ]
test <- miami[(n_train + 1):nrow(miami), ]
fit <- ranger(
SALE_PRC ~ LATITUDE + LONGITUDE + LND_SQFOOT + TOT_LVG_AREA, data = train
)
xvars <- c("LATITUDE", "LONGITUDE", "LND_SQFOOT", "TOT_LVG_AREA")
perm_importance(fit, v = xvars, X = test, y = test$SALE_PRC, normalize = TRUE) |>
plot()
xgroups <- list(coords = c("LATITUDE", "LONGITUDE"), size = c("LND_SQFOOT", "TOT_LVG_AREA"))
perm_importance(fit, v = xgroups, X = test, y = test$SALE_PRC, normalize = TRUE) |>
plot()
Thanks @mayer79 to bring back this discussion. I'll add this issue in the list of priorities when it comes to inspection.
Describe the workflow you want to enable
We have datasets with very large numbers of features, where we only care about feature importance for a subset of them. This can happen when;
ColumnTransformer
) and we know in advance that other columns can have no impact on the final model.Currently permutation_importance computes always computes the importance for all features. This is slow and inefficient if you only want results for a small proportion of the features.
Describe your proposed solution
Add an optional list of indices specifying the features for which to compute importance to eg
permutation_importance(...,features=None)
Then adapt the function to iterate only through the specified indicies if supplied or all columns by default (as happens currently).