ryanbressler / CloudForest

Ensembles of decision trees in go/golang.
Other
739 stars 92 forks source link

Importance Overhall: What Method(s) to Get P Values? #36

Open ryanbressler opened 10 years ago

ryanbressler commented 10 years ago

P-Values for variable importance are desirable as they are easier to interpret and will be potentially easier to drop in to our other tools.

A couple of different methods seem viable for this. The ace method as used in rf-ace involves repeatedly growing a forest including artificial contrasts of all features and using Wilcoxon tests.

Another Method presented in "Identification of Statistically Significant Features from Random Forests" tests the change produced by per mutating each feature and testing on OOB cases after each tree. This is potentially computationally more efficient since only one forest needs to be grown.

Another interesting paper that might be complementary is "Understanding variable importances in forests of randomized trees" which presents work on totally randomized trees, Extra-Trees and random forests suggesting that the more randomized implementations might be of use when we are concerned primarily with feature selection.

ryanbressler commented 10 years ago

An issue with the second approach is that it relies on a chi-squared test of the tree predictions for the permutated and unpermutated case so you are still doing significance testing of the feature values and may have issues with numerical features. The ace paper on the other hand uses testing on the rank of feature importance which may be less problematic. .

ryanbressler commented 10 years ago

The other ACE paper Feature Selection with Ensembles, Artificial Variables and Redundancy Elimination uses a different method. Forests don't depend on the previous one and a student's t test is used to compare variable importance of each variable vs it's contrasts.

tungntdhtl commented 10 years ago

Hi Ryan,

ryanbressler commented 10 years ago

P values aren't calculated yet. Just variable importance as described in the readme:

https://github.com/ryanbressler/CloudForest#importance-and-contrasts

This is a measure of how important each variable is to the predictor and not something that can be fed into the predictor.

tungntdhtl commented 10 years ago

My data set is in this case: https://github.com/ryanbressler/CloudForest#data-with-lots-of-noisy-uninformative-high-cardinality-features I have a weight file based on p-values, how can I type a command to grow trees in CloudRF using this weight file?

You wrote: "The -vet option penalizes the impurity decrease of potential best split by subtracting the best split they can make after the target values cases on which the split is being evaluated have been shuffled". Assume my weight file, namely "wfile.tsv", has 3 columns (featurenames, p-value, importances)

Is this command correct? ~/cloudRF/growforest -train usps -rfpred usps.sf -target 0 -nTrees 500 -vet wfile.tsv

ryanbressler commented 10 years ago

No vet is an internal method and takes no parameters.

I'm not aware of a way to specify feature weights going into a random forest. Since it does its own feature selection internally i'm not even sure how the algorithm could be modified to use them.

Methods like -evaloob -vet and comparison to artifical contrasts improve the internal feature selection.

If you allready have values you want to use for feature selection from rf or another method you'll need to apply a cutoff and/or take the top N features. You can specify white and blacklists of features to use or not use if you don't want to reproduce the data set with the smaller set of features.

Ryan

On Mon, Apr 14, 2014 at 10:12 AM, tungntdhtl notifications@github.comwrote:

My data set is in this case:

https://github.com/ryanbressler/CloudForest#data-with-lots-of-noisy-uninformative-high-cardinality-features I have a weight file based on p-values, how can I type a command to grow trees in CloudRF using this weight file?

You wrote: "The -vet option penalizes the impurity decrease of potential best split by subtracting the best split they can make after the target values cases on which the split is being evaluated have been shuffled". Assume my weight file, namely "wfile.tsv", has 3 columns (featurenames, p-value, importances)

Is this command correct? ~/cloudRF/growforest -train usps -rfpred usps.sf -target 0 -nTrees 500 -vet wfile.tsv

— Reply to this email directly or view it on GitHubhttps://github.com/ryanbressler/CloudForest/issues/36#issuecomment-40385266 .

ryanbressler commented 10 years ago

This is another paper that uses iterative feature selection:

http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002956

It depends on pairwise correlation and network partitioning and each forest/iteration reweighs network modules and features.