mlp2018 / fraud_detection

Kaggle competition: TalkingData AdTracking Fraud Detection Challenge
Apache License 2.0
1 stars 0 forks source link

investigate feature importance of lightgbm #32

Open aranas opened 6 years ago

aranas commented 6 years ago

feature_importance

johannadevos commented 6 years ago

How are the numbers on the x-axis calculated?

andregalvez79 commented 6 years ago

Thank you! I would also like to know what the numbers in the x axis mean. perhaps then we can perform a cut-off around 50? or less than 50? anyone knows any "rule of thumb" or reference for selecting features?

johannadevos commented 6 years ago

In general you want each feature to explain variance that is not yet explained by any preceding features. Sophie mentioned on Slack that there is one feature that apparently "drives 99% of our prediction accuracy". That means that all of the other features together explain the other 1%. It is very likely that within this 1%, there are again one or a few features that are doing all the work. I don't know of any references about this, but common sense says to me that we just eliminate all of the features that are not explaining any variance in the data.

andregalvez79 commented 6 years ago

OK, but which ones are those variables that don't explain any variance? are you suggesting to only leave the variable that drives 99% of the predictions?

johannadevos commented 6 years ago

No, we can also leave one or more variables if they account for a substantial portion of the 1%. I don't know which variables we are talking about, perhaps Sophie can tell us that?

aranas commented 6 years ago

okay, so the 99% is not a real number but was just me expressing that if we only keep the feature with highest importance, we still get very good classification accuracies. You can simply run the script "feature_selection" for yourself to see the numbers but I have also posted the output below. Basically it leaves out the features in order of importance (leaving out the ones with low importance first, in the output below n = number of features left). Normally you would expect that at some point even less important features will still increase your accuracy so there should be a drop-off where having a simpler model also harms predictive power (see for example bottom output of this post: ([https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/]). In our case however, you can leave out almost all features, which I found very suprising: Thresh=12.000, n=22, auc: 0.94% Thresh=24.000, n=21, auc: 0.94% Thresh=25.000, n=20, auc: 0.94% Thresh=35.000, n=19, auc: 0.94% Thresh=44.000, n=18, auc: 0.94% Thresh=57.000, n=17, auc: 0.94% Thresh=70.000, n=16, auc: 0.94% Thresh=78.000, n=15, auc: 0.94% Thresh=92.000, n=14, auc: 0.94% Thresh=106.000, n=13, auc: 0.94% Thresh=120.000, n=12, auc: 0.94% Thresh=124.000, n=11, auc: 0.94% Thresh=125.000, n=10, auc: 0.94% Thresh=125.000, n=10, auc: 0.94% Thresh=141.000, n=8, auc: 0.94% Thresh=147.000, n=7, auc: 0.94% Thresh=155.000, n=6, auc: 0.94% Thresh=160.000, n=5, auc: 0.94% Thresh=177.000, n=4, auc: 0.93% Thresh=186.000, n=3, auc: 0.93% Thresh=209.000, n=2, auc: 0.94% Thresh=253.000, n=1, auc: 0.93%

With respect to the x-axis values. This is what the lightgbm documentation says:

feature_importance(importance_type='split', iteration=-1) Parameters: importance_type (string, optional (default="split")) – How the importance is calculated. If “split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature.

I do get warnings about the categorical features when I run the script, so I am still wondering, whether there might be something wrong with it.

aranas commented 6 years ago

fyi, this is the info when I leave out the confidence features: Thresh=43.000, n=11, auc: 0.93% Thresh=47.000, n=10, auc: 0.94% Thresh=80.000, n=9, auc: 0.93% Thresh=94.000, n=8, auc: 0.93% Thresh=107.000, n=7, auc: 0.94% Thresh=129.000, n=6, auc: 0.92% Thresh=134.000, n=5, auc: 0.93% Thresh=153.000, n=4, auc: 0.94% Thresh=154.000, n=3, auc: 0.92% Thresh=165.000, n=2, auc: 0.92% Thresh=233.000, n=1, auc: 0.91%

feature_importance_noconf