This adds some support for in-memory expansion, along with a few necessary pre-reqs.
A Stopper trait that sets some rules for which nodes are worth splitting (with separate conditions for distributed and in-memory expansions). This lets you set a rule to, for example, not bother splitting anything with < 1000 instances when running in distributed mode (this is now the default). This in itself is actually an important optimization; empirically, it makes expansion take the same amount of time no matter what the depth, whereas it used to slow down considerably.
An alternative Kryo serialization. You now have to explicitly import either KryoInjections or JsonInjections to pick a serialization approach. This was necessary because JSON was falling over when trees got too deep.
A bugfix for the case where there are no splits on a tree; previously we would just drop that tree, now we preserve it unchanged (this is much more likely to happen given Stopper).
Along with this we add a Tree.expand method which gets passed a LeafNode and an in-memory list of Instance objects, and returns a new Node. We also add an expandSmallNodes to Trainer which makes use of this.
The idea is that you expand large nodes as far as you want using the distributed expand, and then you fully expand any small nodes that are left at that point (somewhat oddly, leaving alone any large nodes that still exist at that point, which all else being equal are more attractive targets for expansion - but are not tractable with the in-memory algorithm).
In one large scale test I've done of this so far, it seems to actually make things worse, not better, which is kinda disappointing. But I assume that won't be true in every case.
The one piece missing from this so far is that LeafNode instances are supposed to have a unique id, but it's hard to coordinate that when constructing sub-trees separately and then combining them, which this does. So either they could be renumbered after everything comes together, or we come up with some other scheme for identifying leaves.
cc @snoble @danielhfrank @mlmanapat @ryw90 @DanielleSucher
This adds some support for in-memory expansion, along with a few necessary pre-reqs.
Along with this we add a
Tree.expand
method which gets passed aLeafNode
and an in-memory list ofInstance
objects, and returns a newNode
. We also add anexpandSmallNodes
toTrainer
which makes use of this.The idea is that you expand large nodes as far as you want using the distributed expand, and then you fully expand any small nodes that are left at that point (somewhat oddly, leaving alone any large nodes that still exist at that point, which all else being equal are more attractive targets for expansion - but are not tractable with the in-memory algorithm).
In one large scale test I've done of this so far, it seems to actually make things worse, not better, which is kinda disappointing. But I assume that won't be true in every case.
The one piece missing from this so far is that LeafNode instances are supposed to have a unique id, but it's hard to coordinate that when constructing sub-trees separately and then combining them, which this does. So either they could be renumbered after everything comes together, or we come up with some other scheme for identifying leaves.
cc @snoble @danielhfrank @mlmanapat @ryw90 @DanielleSucher