suiji / Arborist

Scalable decision tree training and inference.
Other
82 stars 14 forks source link

Support for maxnodes #23

Closed ck37 closed 7 years ago

ck37 commented 7 years ago

Hello,

It would be helpful to support the maxnodes argument as implemented in randomForest - is that a possibility? It can be a better way to limit tree complexity compared to minnodesize, because it allows initial splits to be small provided that the max node size hasn't been exceeded. Segal & Xiao 2011 discuss it:

"The R package RF includes a tuning parameter maxnodes that can be used to override growing maximal trees. Judiciously setting this parameter is anticipated to be beneficial in large sample size settings, and/or in situations where maximal trees severely overfit. Such control seems less awkward, especially for large sample sizes, than trying to govern tree size by the (related) parameter nodesize which determines the minimum size for terminal nodes: prescribing maxnodes more readily allows for penalization of tree complexity whereas not precluding vari- ably sized nodes. Lin and Jeon provide a theoretic perspective that is also contrary to use of maximal trees."

Segal, M., & Xiao, Y. (2011). Multivariate random forests. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1), 80-87.

Thanks, Chris

suiji commented 7 years ago

Yes, we should be able to support the functionality of the 'maxnodes' option. According to Andy Liaw's documentation, the option places an upper limit on the number of terminals. Thank you, and please feel free to suggest other enhancements.

It should be possible to get this into the upcoming CRAN release, although Tal's request from several months ago takes precedence.

ck37 commented 7 years ago

Thanks, that would be great - appreciate it.

suiji commented 7 years ago

The Arborist currently splits all nodes of a single tree level in a sort of mini-batch, rather than one node at a time. At the end of a splitting pass, then, the number of new nodes could exceed the upper limit by a large number. When this happens, I propose to order the new nodes by information content - i.e., the same information criterion used to split - and discard as many as necessary, beginning with the least informative split. Please feel free to suggest other rejection schemes.

Note that this approach will not work when we support independent training of separate subtrees, a feature we envision adding in the near future. At that time, we should probably employ a "detraining" approach on the fully-trained tree, in which splits are successively merged until the resulting tree is sufficiently small.

ck37 commented 7 years ago

That seems reasonable to me, although I'm wondering if there is a risk of overfitting that way.

The other way that comes to mind would be to reconstruct the order that the nodes would have been created, and eliminate in that reverse order. The order of construction seems like it is more arbitrary, but it might not overfit as much as the first option.

suiji commented 7 years ago

Since this is a global property of the tree, solutions that rely on early termination will vary with the implementation. To avoid falling into that trap, detraining might be the better option. We could use an information metric or random sampling - either approach would be reproducible.

suiji commented 7 years ago

Say that 'n' terminals have been created, where 'n' is larger than the specified value of 'maxNodes'. So the task is to remove 'm = n - maxNodes' terminals from the tree. A couple of treatments come to mind:

i) Select 'm' pairs of leaves to unsplit from the frontier, a priori.

ii) Select 'm' pairs, one at a time, from each successively-modified frontier.

Option i) is easier to implement and probably a faster operation to perform, but ii) offers more opportunities to alter the shape of the trained tree.

Do you have any deeper insights?

ck37 commented 7 years ago

I don't have any deeper insights personally. I would say going with whatever the randomForest package does would be a good first approach, if that's possible to determine.

Otherwise I personally would order the parent nodes by some complexity or split criteria used to make the split, then remove the first m pairs of child nodes based on that ordering. I can't confess to any particular expertise on this issue.

suiji commented 7 years ago

The premise behind i) was optimistic, and does not hold in general. Detraining probably has to employ some form of ii).

The RF package employs 'maxnodes' as an a priori limit on training. This introduces a bias, though, as the shape of the tree will vary with the order in which it is built - an implementational choice. A detraining approach would be independent of such details, although would admittedly do a lot of work that would later be undone.

We can use the information criterion to weight the sampling. In an earlier post you were concerned that doing so might lead to overfitting. Maybe we should try such an approach first, though: we can always resort to unweighted sampling if need be.

suiji commented 7 years ago

There is an "alpha" implementation on Github now. The new option is dubbed maxLeaf, for clarity.

Things seem to be working. As more nodes are "desplit", training quality degrades gracefully. Entire trees can be desplit back into single-leaf root nodes relatively quickly. The only apparent blemish is that trees now have empty slots where the splits used to be. It might be relatively straightforward to compact these gaps away, though.

suiji commented 7 years ago

The gaps are gone.

Closing this Issue, but feel free to reopen if there is more to discuss.