stripe-archive / brushfire

Distributed decision tree ensemble learning in Scala
Other
391 stars 50 forks source link

in-memory expansion #8

Closed avibryant closed 9 years ago

avibryant commented 9 years ago

This adds some support for in-memory expansion, along with a few necessary pre-reqs.

Along with this we add a Tree.expand method which gets passed a LeafNode and an in-memory list of Instance objects, and returns a new Node. We also add an expandSmallNodes to Trainer which makes use of this.

The idea is that you expand large nodes as far as you want using the distributed expand, and then you fully expand any small nodes that are left at that point (somewhat oddly, leaving alone any large nodes that still exist at that point, which all else being equal are more attractive targets for expansion - but are not tractable with the in-memory algorithm).

In one large scale test I've done of this so far, it seems to actually make things worse, not better, which is kinda disappointing. But I assume that won't be true in every case.

The one piece missing from this so far is that LeafNode instances are supposed to have a unique id, but it's hard to coordinate that when constructing sub-trees separately and then combining them, which this does. So either they could be renumbered after everything comes together, or we come up with some other scheme for identifying leaves.

cc @snoble @danielhfrank @mlmanapat @ryw90 @DanielleSucher

avibryant commented 9 years ago

Ok, this is ready to merge, I think (and with some tuning did help in the test case I mentioned above).

snoble commented 9 years ago

lgtm

avibryant commented 9 years ago

@snoble changed the get to a flatMap