importance = "permutation", what is this doing?

angelaparodymerino commented 7 years ago

Hi,

I am quite new running RF and using ranger. I have ~3000 SNPs, 265 individuals, and a quantitative phenotype. My aim is to use RF (ranger) to rank the SNPs by the ones that explain more the phenotype variation in the population (inter-individual variation).

Anyway, I have read that "The most reliable measurement is the (unscaled) permutation importance" (Szymczak et al., 2016). It refers to the most reliable measurement for assess the predictive power of each variable (or SNP), which is what I am after. Therefore, I have to use importance = "permutation" which is the only way to compute permutation importance (right?). But, I don't understand well what is being permuted? I suppose the samples (??) Could someone explain a bit more in detain how permutation in Random Forest work? I really would like to understand!

Thanks in advance for the help,

Regards,

'Angela Parody-Merino

mnwright commented 6 years ago

From our paper Do little interactions get lost in dark random forests?

The basic idea of the permutation variable importance approach [18] is to consider a variable important if it has a positive effect on the prediction performance. To evaluate this, a tree is grown in the first step, and the prediction accuracy in the OOB observations is calculated. In the second step, any association between the variable of interest x_i and the outcome is broken by permuting the values of all individuals for x_i, and the prediction accuracy is computed again. The difference between the two accuracy values is the permutation importance for x_i from a single tree. The average of all tree importance values in a random forest then gives the random forest permutation importance of this variable. The procedure is repeated for all variables of interest. The package ranger [24] was used in our analyses.

philiporlando commented 5 years ago

I'm getting a little hung up on the nomenclature used in this CV post. Is the permutation importance method different from a "conditional Random Forest"?

mnwright commented 5 years ago

Yes, in a conditional inference forest, a permutation test is used for splitting, while the permutation importance uses permutations to calculate variable importance.

philiporlando commented 5 years ago

Thank you for the clarification. Is it possible to incorporate this splitting rule within ranger::train()? So far, I've only found the partykit::cforest() function for performing conditional inference forests, but my workflow is dependent on the efficiency of ranger given the scale of my data.

mnwright commented 5 years ago

If you have a regression, survival or binary classification problem, you could use splitrule = "maxstat" for splitting with maximally selected rank statistics, which is an approximation of the conditional inference splitting, see our paper Unbiased split variable selection for random survival forests using maximally selected rank statistics for details. Ask me for the PDF if you don't have access.

philiporlando commented 5 years ago

Thanks for the info! I'm able to access your paper as well.

DeFilippis commented 5 years ago

@mnwright

Strobl et al. [68] show that the default option (in ranger) of sampling with replacement leads to a Variable Importance Measure bias in favor of predictors with many categories even if the trees are built using an unbiased criterion.

Two questions:

Does this bias persist even when using the "maxstat" split rule, and thus approximating conditional inference forests which do not suffer from the bias of selecting predictors with many unique values?
Do you have any guidance on whether to sample with or without replacement when using the "maxstat" split rule? I'm looking to cite evidence in favor of the default method in most rf implementations (with replacement), but have been unable to find anything. In the party package, conditional inference forests use a without-replacement sample of size 0.632 * n. Do you think this is a good default value to use for maxstat trees?

Citation: C. Strobl, A. L. Boulesteix, A. Zeileis, and T. Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8:25, 2007.

mnwright commented 5 years ago

Yes, this "bootstrapping bias" is also apparent for the "maxstat" splitting. See Fig.4 in our Stat Med paper*.
If prediction accuracy is your major concern, there is not much of a difference in my experience. See also Figs. 4 and 5 of our Stat Med paper (linked above). If any kind of model interpretation is your goal, subsampling might be the better choice (to avoid this kind of bias).

The choice of 0.632 * n (aka 1-exp(-1)) is quite common because this is the number of unique observations if sampled with replacement. It is a reasonable default but might be improved by tuning. See also Section 2.1.2 of our recent paper on hyperparameters** and Philipp et al.'s paper on tunability.

*Free preprint: https://arxiv.org/pdf/1605.03391
**Free preprint: https://arxiv.org/pdf/1804.03515

djvill commented 4 years ago

If you have a regression, survival or binary classification problem, you could use splitrule = "maxstat" for splitting with maximally selected rank statistics, which is an approximation of the conditional inference splitting, see our paper Unbiased split variable selection for random survival forests using maximally selected rank statistics for details. Ask me for the PDF if you don't have access.

In ranger 0.11.2, I'm getting "Error: maxstat splitrule applicable to regression or survival data only" when I try to use splitrule = "maxstat" for binary classification. Am I missing something?

mnwright commented 4 years ago

You'll have to grow a regression forest on 0/1 outcomes. For two classes, this is equivalent to classification with probability = TRUE. The maxstat split rule works in this way, but it's not well tested.

martynakgajos commented 2 years ago

A follow-up question: I was trying to find the answer in the code, but I got lost between the trees: how is permutation feature importance calculated in the case of a regression task; the decrease of which measure is used instead of accuracy (by default)?

mnwright commented 2 years ago

It's the mean squared error.

mikoontz commented 2 years ago

I found my way to this issue while trying to build a fast, unbiased random forest on a binary outcome hoping, eventually, to do some model inference using a variable importance metric that is robust to correlations among predictors. (For posterity, the Sobol-MDA approach to deal with correlated predictors led me to this other {ranger} issue, which led me to "conditional predictive impact" and the {cpi} package). I learned a lot! Thanks for all the links and suggestions.

In trying to implement this, I am hoping to account for class imbalance too. When I switch from a classification problem of the binary outcome to a regression problem with a binary outcome to effectively grow a probability-like random forest, I lose the ability to include class.weights in the model building. I've seen some implementations where the case.weights argument is used to stand in for class.weights by assigning each observation a weight corresponding to the inverse of its frequency (1 / table(data$target)). That does seem to improve model performance, but some trial-and-error leads me to believe that this might be doing something that I don't intend (see https://github.com/BlasBenito/spatialRF/issues/12).

Is there a way that you would recommend accounting for class imbalance in a 0/1 outcome regression problem? Particularly a way that can be implemented in {mlr3} so that I can use {cpi}?

mnwright commented 2 years ago

I think you are looking for probability trees/forests. In ranger, define it as a classification outcome (i.e. factor) and set probability = TRUE in ranger(). In mlr3 just use a classification task and set predict_type = "prob" in the lrn() call (there is an example in the cpi package).

mikoontz commented 2 years ago

Thanks so much for your reply! You are right that I am essentially looking for probability trees, but I learned from this issue that I can only use the splitrule = "maxstat" approach on regression problems. But also that I could use regression trees on a 0/1 outcome to mimic a probability tree.

You'll have to grow a regression forest on 0/1 outcomes. For two classes, this is equivalent to classification with probability = TRUE. The maxstat split rule works in this way, but it's not well tested.

The ultimate goal is to try to get unbiased variable selection, since I'm interested in the inference about important features, which is what led me to conditional inference trees, and then to your Wright et al., 2017 paper.

So is there an appropriate way to mimic class.weights from classification/probability trees if I'm constrained to building a regression random forest (so I can use the splitrule = "maxstat")?

For posterity, I did learn that I can implement the case.weights approach in {mlr3} by using the set_col_roles() method on the task. FIrst, I add a column to my data.frame that represents the weight for each row, then set the role for that column to be "weight": task$set_col_roles(cols = "wt", roles = "weight")

mnwright commented 2 years ago

No, class.weights and maxstat splitting won't work together because the class weights directly change the Gini index splitting rule, see e.g. here: https://github.com/imbs-hl/ranger/blob/5f71872d7b552fd2cf652daab92416f52976df86/src/TreeProbability.cpp#L290.

However, you can use case.weights because those only change the bootstrapping/subsampling of observations for the trees.

And again, be careful with splitrule = "maxstat" on 0/1 data. Doing rank statistics with only ties sounds wrong... If you (or anyone else) wants to do some testing, please share!

mikoontz commented 2 years ago

This is so helpful; thank you very much for your time!

And thanks also for your caveat about splitrule = "maxstat" on 0/1 data. I can likely provide some testing, though I think the supplemental material of Nembrini 2018 might do a better job than whatever I might have done!

You've likely seen that work, but for posterity: they performed simulations to identify bias in variable importance metrics that arise from bias in the splitting rule. They use 0/1 outcomes to investigate this in the classification scenario, and use splitrule = "maxstat" with alpha = 1 and minprop = 0 in a {ranger} regression to perform the "classification".

From my read, the results for the 0/1 outcome {ranger} regression using splitrule = "maxstat" are very encouraging! The approach to mimic classification (i.e., with a "maxstat" split rule and regression on a 0/1 outcome) shows similar behavior (as regression on a continuous target) in terms of avoiding variable selection bias trickling down to the variable importance metrics.

Is there different testing that would make us feel more confident of this?

mnwright commented 2 years ago

Yes, that looks good! I would also check the prediction performance, but I'm sure you are already doing that. @snembrini maybe this is interesting for you?

snembrini commented 2 years ago

thanks @mnwright for looping me in

we tried maxstat with binary outcomes only for completeness but we never used it for prediction for instance

Now, I would like to clarify and demystify a couple of things

real "unbiased" variable selection in trees does not exist
If the top ranking predictors are not the same using both impurity and permutation importance, then I wouldn't trust either
all algorithms and all variable importances have their own problems and it would be ill-advised to just say "use X rather than Y")
conditional forests (CF) are way more complicated to build and the conditional permutation importance is boosted for uncorrelated predictor. The classical impurity importance is still "problematic" in CF
Using pvalues smaller than 5% in conditional forests is per se problematic and it's a type of approach that the scientific community is currently trying to distance itself from
The underlying idea is to map ALL predictors into a common space: using pvalues is straight forward because it is a probability
ranger with either standard split statistics or maxstat is way faster and the permutation importance (either the column or random note assignment: idk if @mnwright had the chance to implement that) is a safe choice
most of the problems with traditional random forest variable importance is the split to purity: regular random forests have better prediction than conditional forests because the stopping rule. I do not think that the so-called bias is necessarily a problem, because it tells you exactly what it does, the problem is when it is used to rank predictors: a 0/1 predictor can only provide one split, while a numeric predictor can split a lot more, but this is just for nodes that are far from the top of the tree
if I am being lazy and I want to use just one forest, I would probably use ranger with permutation importance, or ranger + varImpRanger, let's say that my class is imbalanced I might prefer the AUC importance rather than the misclassification
Otherwise, I would train one forest with ranger to purity (the default) to predict and another where trees are not grown to purity. There is a sweet spot where the trees are split enough that the conditional distribution of the outcome is modeled without too many useless splits that would allow numeric predictors to split more than they should
Most of our work was to compare variable importance of different algorithms so that the comparison was fair, but to do that we need to grow all of those
One limit of our work and current research is that - to the best of my knowledge - there is no traditional random forest implementation that allows the user to limit the number of splits: in conditional trees you can force the tree to split just once, bu the idea behind boosted trees is that I might want to have trees that split once, twice or 3/4/5... times (@mnwright have you ever considered that as an option in ranger?). Training this parameter and prediction accuracy could be used to find a sweet spot between prediction and variable importances that retain as little noise as possible. But in general I would recommend growing to purity for prediction and setting a stopping rule for variable importance (even the permutation importance will have a larger variance if we grow to purity)
To the best of my knowledge the classical ways to approach a classification with an imbalanced outcome would be weights or sampling (which Breiman called weighted RF and Balanced RF). Breiman had found that creating trees using always the minority class and sampling with replacement from the majority class for instance was computationally more efficient that using weights. (which he claims to be more sensitive to noise)

mikoontz commented 2 years ago

This is great; thanks so much for providing some of this intuition here.

I'm convinced about using the computational power of {ranger} with a maxstat split rule.

I think the only outstanding question in my mind is about dealing with correlated predictors. That seems to be a factor that also plays into which variable importance measure you might pick. In a different issue, it was you that pointed me to CPI for exactly this use case.

Understanding that there is no such thing as totally "unbiased" variable selection, it sounds like more research is needed before using the impurity_corrected importance metric for correlated predictors which is implemented in {ranger} and detailed in your 2018 paper? But my understanding is also that a Gini impurity-based feature importance is likely to be sensitive to correlated predictors.

snembrini commented 2 years ago

Unbiasedness is unachievable for recursive partitioning when you have predictors of different types

Correlation is problematic in general within any statistical framework: CPI would give a higher ranking to uncorrelated predictors, which I think makes sense for most problems, but variabile importance is a vague concept and it really depends on what you want to get from the data.

Correlated variables compete and share some of the same information so for prediction I'd rather pick predictors that are not correlated but that's a personal question

The total reduction in heterogeneity is fine because it measures exactly that, the problem is when you use it on a forest that splits to purity. Splitting criteria and variable importance and prediction are 3 different beasts that have different goals.

"Unbiased" in this case is a misnomer used in machine learning because there is no parameter.

The variable importance we presented in 2018 has the purpose of mimicking the permutation importance, which can be extremely slow for some datasets (even days), and can be computes at basically no cost when the forest is generated

Hope this helps

imbs-hl / ranger

importance = "permutation", what is this doing? #237