koaning / scikit-lego

Extra blocks for scikit-learn pipelines.
https://koaning.github.io/scikit-lego/
MIT License
1.28k stars 117 forks source link

[FEATURE] Feature Importance of ZeroInflatedRegressor #583

Closed jcoding2022 closed 1 year ago

jcoding2022 commented 1 year ago

In the ZeroInflatedRegressor, there is a classifier and a regressor. Can we extract the feature importance from the trained classifier and from the trained regressor? How about the overall feature importance from the ZeroInflatedRegressor, taking into account the importance of a feature both in the classifier and in the regressor?

koaning commented 1 year ago

You should be able to access the trained classifier and regressor separately. If this does not suffice, could you elaborate?

jcoding2022 commented 1 year ago

You should be able to access the trained classifier and regressor separately. If this does not suffice, could you elaborate?

Sure. I am interested in the following two use cases:

  1. After training a zero-inflated regressor, how do we extract the respective feature importance from the trained classifier and regressor? Are there any methods to call? A workaround I used is to retrain the classifier and get the feature importance from it, and retrain the regressor and get the feature importance from it. It works, but it's just tedious. I was wondering if you have a better approach.
  2. Even if you can get the importance of a feature in the classifier and in the regressor, how can we get the importance of a feature in the overall zero-inflated regressor? For example, feature 1 is the no. 2 important feature in the classifier and no. 10 importance feature in the regressor, what is the rank of importance of feature 1 in the overall zero-inflated regressor?
FBruzzesi commented 1 year ago

Hi @jcoding2022, let me expand on what Vincent mentioned.

For 1. you can directly access the trained classifier and regressor passed to the ZeroInflatedRegressor model, therefore their feature importance if they are available. How to do that?

import numpy as np
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklego.meta import ZeroInflatedRegressor

np.random.seed(0)

X = np.random.randn(10000, 4)
y = ((X[:, 0]>0) & (X[:, 1]>0)) * np.abs(X[:, 2] * X[:, 3]**2)

model = ZeroInflatedRegressor(
    classifier=RandomForestClassifier(random_state=0),
    regressor=RandomForestRegressor(random_state=0)
)
_ = model.fit(X, y)

model.classifier_.feature_importances_  # array([0.5203769 , 0.4763605 , 0.00154362, 0.00171898])
model.regressor_.feature_importances_  # array([0.01817036, 0.01665635, 0.32113857, 0.64403472])

However, this is only possible if the classifier and the regressor have the featureimportances attribute, and this is not necessary the case. For example:

model = ZeroInflatedRegressor(
    classifier=LogisticRegression(random_state=0),
    regressor=RandomForestRegressor(random_state=0)
)
_ = model.fit(X, y)

model.classifier_.feature_importances_  # AttributeError: 'LogisticRegression' object has no attribute 'feature_importances_'

This latter example leads to addressing 2.

jcoding2022 commented 1 year ago
* What should be the feature importance of the overall model?

Let me try to answer this question. A general use case of the zero-inflated regressor is to model data with a zero-inflated, continuous target. We train a classifier to predict whether the target is zero or not, and train a regressor to predict the value of the target conditioned on the classifier predicting the target is nonzero. A feature can impact both the classifier or the regressor.

FBruzzesi commented 1 year ago

I know how feature importance works, my concern is how to properly define it in this case, especially because ZeroInflatedRegressor is completely agnostic from the classifier and regression models provided.

In the above code snippet you can see how to access those values, if available. I don't think it is in the scope of the class to implement how a blend of feature importance should behave

FBruzzesi commented 1 year ago

Closing this issue for now as not planned, because there is not clear vision nor a path to take for the implementation. Feel free to comment further if needed.