Open Gunnvant opened 6 years ago
What you mean by you've checked with R? how does it compare? If the variables are highly collinear then you will see this sort of behavior, although all of them being exactly zero doesn't make a lot of sense. I suggest that you use the actual rfpimp
python package which has improved versions of those routines.
Hello I am attaching the result comparison for both python and R 's default Random Forest feature importance (mean decrease in accuracy). As you can see in the results, the current python implementation results in zero value for all variables while the R results are different (all the values are rather low in magnitude though). I can share this data for reproducibility. comparsions_feature_imp.zip
Very interesting. I'm surprised by the 0.0. I would imagine it would be very small, as R's are, but not exactly 0. Are you sure that min_samples_leaf=100
is appropriate? I've never used a value that big. Another thing to keep in mind is that if your model is not very accurate, then the feature importance is not very meaningful. What does your OOB error/score show for the model? Any chance I could get access to the data to try it myself?
I am attaching the code where I've computed some accuracy metrics on oob data and I am also sharing the data files. Is there a possibility of some weird round off occurring some where? In this data the class prevalence is around 5%, have computed accuracy assuming 5% as the cutoff. data_accuracy.zip
thanks. I'll look when I can.
Hey, I am encountering a similar (the same?) thing at the moment when calculating permutation importance for some random forest features. The same result as in this issue (everything is rated 0.0) occurs, when I use many features (86) at once. For comparision, the gini importance ratings are still "normal" for the same amount of features. I also tested the effect of removing the highest rated feature (according to gini) and it indeed had a (rather significant) effect on the result. Dunno if this helps, but it seemed to be related.
Weird. Is it possible there is lots and lots of collinearity? try running the feature dependence plot.
Yes, there is quite a bit of collinearity according to the dependence plot, but there are also some features which are not collinear at all. I also tried using the features that have some collinearity and it leads to the same result (0.0). And as in the previous test, removing the highest rated feature (according to gini) has an impact on the result.
Are you guys using the rfpimp package or the code from the article? the package has been tested much more.
I copy the following function from this repo: _importances sample plotimportances and try to plot the feature importance.
My dataset comes from kaggle: https://www.kaggle.com/mlg-ulb/creditcardfraud/data
code is simple:
df = pd.read_csv('creditcard.csv')
X, y = df.drop('Class', axis=1), df['Class']
base_rf = RandomForestClassifier(n_estimators=100, min_samples_leaf=5, n_jobs=-1, oob_score=True)
rf2 = clone(base_rf)
rf2.fit(X, y)
print(rf2.oob_score_) # oob: out-of-bag
I2 = importances(rf2, X, y)
plot_importances(I2)
However I also have zero importance
What is the err metric or out of bag score? If it is not a good classifier you will not get good results
On Fri, Jul 24, 2020 at 12:09 AM Cecile Liu notifications@github.com wrote:
I copy the following function from this repo:
importances sample plot_importances and try to plot the feature importance.
My dataset comes from kaggle: https://www.kaggle.com/mlg-ulb/creditcardfraud/data
code is simple: df = pd.read_csv('creditcard.csv') X, y = df.drop('Class', axis=1), df['Class'] base_rf = RandomForestClassifier(n_estimators=100, min_samples_leaf=5, n_jobs=-1, oob_score=True) rf2 = clone(base_rf) rf2.fit(X, y) print(rf2.oobscore) # oob: out-of-bag I2 = importances(rf2, X, y) plot_importances(I2)
However I also have zero importance [image: image] https://user-images.githubusercontent.com/26088078/88367908-785c8b80-cdbf-11ea-8081-5c17d4a5a92e.png
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/parrt/random-forest-importances/issues/12#issuecomment-663379187, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABLUWNCORLG2VROBY2MX5LR5EXQ3ANCNFSM4GAOMYSA .
-- Dictation in use. Please excuse homophones, malapropisms, and nonsense.
Did you tray to get permutation importance on a test dataset ? If your classifier is very good on the train dataset (e.g. score = 1) it could be possible that removing one variable don't change its "incredible" score and so importance could be 0 for all varaibles...
Any update regarding this problem?
Still not sure there is a problem...
I am using a dataset to compute feature importance using permutation. Have checked results with R implementation, I am getting non zero var importance. What could be the reason? Here is my code
Gives an output of:
I also computed the
oob_classiifer_accuracy()
by permuting all the variables, the accuracy reported doesn't change at all. The event rate is data is rather low around 5%.