imbs-hl / ranger

A Fast Implementation of Random Forests
http://imbs-hl.github.io/ranger/
774 stars 194 forks source link

Extratrees implementation #461

Closed talegari closed 4 years ago

talegari commented 4 years ago

Hi Marvin,

Ranger's implementation of extratrees does not seem to result in single observation per node under the following setting:

splitrule = "extratrees"
mtry = 1
min.node.size = 1

It looks like, if the node selects a variable which has same values and hence cannot split results in stopping. Is this by design?

Quick comparison with sklearn's implementation showed this: max_features ... Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features. I confirmed it with running an example extratrees_ranger_sklearn.pdf across both ranger and sklearn.

This issue percolates to my package solitude where I use ranger to grow isolation forests.

mnwright commented 4 years ago

You are right, we don't try further variables if no good split is found in the mtry variables. I'm pretty sure that is also how it's implemented in most random forest implementations (e.g. Breimans Fortran code).

talegari commented 4 years ago

@mnwright Extremely randomized trees(ERT) are different from random forests when it comes to selecting variables for splitting.

Here is the pseudo-code for extratrees from Geurts paper which introduced ERTs (Table 6, pg 28):

## Build an extra tree ensemble(S). 
Input: a training set S. 
Output: a tree ensemble T ={t1,...,tM}. 

    – For i=1 to M
        - Generate a tree: ti= Build an extra tree(S); 
    – Return T .

## Build an extra tree(S).
Input: a training set S.
Output: a tree t.

    – Return a leaf labeled by class frequencies (or average output, in regression) in S if
        (i) |S| < nmin, or
        (ii) all candidate attributes are constant in S, or (iii) the output variable is constant in S
    – Otherwise:
        1. Select randomly K attributes, {a1 ,...,aK }, without replacement, among all (non constant in S) candidate attributes;
        2. GenerateKsplits{s1,...,sK},wheresi =Pick a random split(S,ai),∀i =1,...,K;
        3. Select a split s∗ such that Score(s∗, S) = maxi=1,...,K Score(si , S);
        4. Split S into subsets Sl and Sr according to the test s∗;
        5. Build tl = Build an extra tree(Sl ) and tr = Build an extra tree(Sr ) from these subsets;
        6. Create a node with the split s∗, attach tl and tr as left and right subtrees of this node and return the resulting tree t.

## Pick a random split(S,a)
Input: a training set S and an attribute a. 
Output: a split.

    – If the attribute a is numerical:
        - Compute the maximal and minimal value of a in S, denoted respectively by amS in and amS ax ; • Draw a cut-point ac uniformly in [amS in , amS ax ];
        - Return the split [a < ac].
    – If the attribute a is categorical (denote by A its set of possible values):
        - Compute AS the subset of A of values of a that appear in S;
        - Randomly draw a proper non empty subset A1 of AS and a subset A2 of A\AS ; • Return the split [a ∈ A1 ∪ A2].

Notice that that the first bullet point in 'otherwise' under 'Build an extra tree(S)' says that Select randomly K attributes, {a1 ,...,aK }, without replacement, among all (non constant in S) candidate attributes. This means that 'mtry' number of variables should be chosen among the variables which have at least two unique values. In particular, for mtry = 1, tree growth beyond that point in a particular node stops only if every variable has one unique value.

mnwright commented 4 years ago

Yes, there are some differences. We should make clear in the docs that we implement a randomized splitting rule as in ERT but selecting splitrule = "extratrees" doesn't change anything else in the RF algorithm, see #408.