parrt / random-forest-importances

Code to compute permutation and drop-column importances in Python scikit-learn models
MIT License
601 stars 131 forks source link

Feature importance is zero!!! #12

Open Gunnvant opened 6 years ago

Gunnvant commented 6 years ago

I am using a dataset to compute feature importance using permutation. Have checked results with R implementation, I am getting non zero var importance. What could be the reason? Here is my code

from rfpimp import *
from sklearn.ensemble.forest import _generate_unsampled_indices

# TODO: add arg for subsample size to compute oob score

def oob_classifier_accuracy(rf, X_train, y_train):
   X = X_train.values
    y = y_train.values

    n_samples = len(X)
    n_classes = len(np.unique(y))
    predictions = np.zeros((n_samples, n_classes))
    for tree in rf.estimators_:
        unsampled_indices = _generate_unsampled_indices(tree.random_state, n_samples)
        tree_preds = tree.predict_proba(X[unsampled_indices, :])
       predictions[unsampled_indices] += tree_preds

    predicted_class_indexes = np.argmax(predictions, axis=1)
    predicted_classes = [rf.classes_[i] for i in predicted_class_indexes]

    oob_score = np.mean(y == predicted_classes)
    return oob_score

def permutation_importances(rf, X_train, y_train, metric):
    """
    Return importances from pre-fit rf; metric is function
    that measures accuracy or R^2 or similar. This function
    works for regressors and classifiers.
    """
    baseline = metric(rf, X_train, y_train)
    imp = []
    for col in X_train.columns:
        save = X_train[col].copy()
        X_train[col] = np.random.permutation(X_train[col])
        m = metric(rf, X_train, y_train)
        X_train[col] = save
        imp.append(baseline - m)
    return np.array(imp)
rf = clone(base_rf)
rf.fit(X_train, y_train)
oob = oob_classifier_accuracy(rf, X_train, y_train)
print("oob accuracy",oob)

imp = permutation_importances(rf, X_train, y_train,
                              oob_classifier_accuracy)
imp

Gives an output of:

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

I also computed the oob_classiifer_accuracy() by permuting all the variables, the accuracy reported doesn't change at all. The event rate is data is rather low around 5%.

parrt commented 6 years ago

What you mean by you've checked with R? how does it compare? If the variables are highly collinear then you will see this sort of behavior, although all of them being exactly zero doesn't make a lot of sense. I suggest that you use the actual rfpimp python package which has improved versions of those routines.

Gunnvant commented 6 years ago

Hello I am attaching the result comparison for both python and R 's default Random Forest feature importance (mean decrease in accuracy). As you can see in the results, the current python implementation results in zero value for all variables while the R results are different (all the values are rather low in magnitude though). I can share this data for reproducibility. comparsions_feature_imp.zip

parrt commented 6 years ago

Very interesting. I'm surprised by the 0.0. I would imagine it would be very small, as R's are, but not exactly 0. Are you sure that min_samples_leaf=100 is appropriate? I've never used a value that big. Another thing to keep in mind is that if your model is not very accurate, then the feature importance is not very meaningful. What does your OOB error/score show for the model? Any chance I could get access to the data to try it myself?

Gunnvant commented 6 years ago

I am attaching the code where I've computed some accuracy metrics on oob data and I am also sharing the data files. Is there a possibility of some weird round off occurring some where? In this data the class prevalence is around 5%, have computed accuracy assuming 5% as the cutoff. data_accuracy.zip

parrt commented 6 years ago

thanks. I'll look when I can.

ThomasTeodorowicz commented 5 years ago

Hey, I am encountering a similar (the same?) thing at the moment when calculating permutation importance for some random forest features. The same result as in this issue (everything is rated 0.0) occurs, when I use many features (86) at once. For comparision, the gini importance ratings are still "normal" for the same amount of features. I also tested the effect of removing the highest rated feature (according to gini) and it indeed had a (rather significant) effect on the result. Dunno if this helps, but it seemed to be related.

parrt commented 5 years ago

Weird. Is it possible there is lots and lots of collinearity? try running the feature dependence plot.

ThomasTeodorowicz commented 5 years ago

Yes, there is quite a bit of collinearity according to the dependence plot, but there are also some features which are not collinear at all. I also tried using the features that have some collinearity and it leads to the same result (0.0). And as in the previous test, removing the highest rated feature (according to gini) has an impact on the result.

parrt commented 5 years ago

Are you guys using the rfpimp package or the code from the article? the package has been tested much more.

inhail commented 4 years ago

I copy the following function from this repo: _importances sample plotimportances and try to plot the feature importance.

My dataset comes from kaggle: https://www.kaggle.com/mlg-ulb/creditcardfraud/data

code is simple: df = pd.read_csv('creditcard.csv') X, y = df.drop('Class', axis=1), df['Class'] base_rf = RandomForestClassifier(n_estimators=100, min_samples_leaf=5, n_jobs=-1, oob_score=True) rf2 = clone(base_rf) rf2.fit(X, y) print(rf2.oob_score_) # oob: out-of-bag I2 = importances(rf2, X, y) plot_importances(I2)

However I also have zero importance image

parrt commented 4 years ago

What is the err metric or out of bag score? If it is not a good classifier you will not get good results

On Fri, Jul 24, 2020 at 12:09 AM Cecile Liu notifications@github.com wrote:

I copy the following function from this repo:

importances sample plot_importances and try to plot the feature importance.

My dataset comes from kaggle: https://www.kaggle.com/mlg-ulb/creditcardfraud/data

code is simple: df = pd.read_csv('creditcard.csv') X, y = df.drop('Class', axis=1), df['Class'] base_rf = RandomForestClassifier(n_estimators=100, min_samples_leaf=5, n_jobs=-1, oob_score=True) rf2 = clone(base_rf) rf2.fit(X, y) print(rf2.oobscore) # oob: out-of-bag I2 = importances(rf2, X, y) plot_importances(I2)

However I also have zero importance [image: image] https://user-images.githubusercontent.com/26088078/88367908-785c8b80-cdbf-11ea-8081-5c17d4a5a92e.png

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/parrt/random-forest-importances/issues/12#issuecomment-663379187, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABLUWNCORLG2VROBY2MX5LR5EXQ3ANCNFSM4GAOMYSA .

-- Dictation in use. Please excuse homophones, malapropisms, and nonsense.

floreslg commented 3 years ago

Did you tray to get permutation importance on a test dataset ? If your classifier is very good on the train dataset (e.g. score = 1) it could be possible that removing one variable don't change its "incredible" score and so importance could be 0 for all varaibles...

enesok commented 1 year ago

Any update regarding this problem?

parrt commented 1 year ago

Still not sure there is a problem...