stripe-archive / brushfire

Distributed decision tree ensemble learning in Scala
Other
391 stars 50 forks source link

Make all splits binary #76

Closed erik-stripe closed 8 years ago

erik-stripe commented 8 years ago

This PR simplifies brushfire's conception of split nodes (and splits) so that all decisions are binary. In practice this is almost always the case.

Some benefits of this:

  1. Should drastically reduce the number of Predicate instances we need.
  2. Avoid allocating a collection instances per split node.
  3. Simplify algorithms which had to encode/decode implicit binary structure.
  4. Nicer to use single concrete Split class.

Some drawback:

  1. Different serialization format.
  2. Source/binary-incompatible change.

What do you all think?

tixxit commented 8 years ago

Re: serialization format - is there an easy way to include some backwards compatibility here, assuming all our splits are actually just binary splits? Otherwise this will make the migration path quite a bit harder, as we have production models using the old format.

tixxit commented 8 years ago

@NathanHowell : Mind taking a look?

tixxit commented 8 years ago

Overall I'm really happy with this change. Great stuff!

avibryant commented 8 years ago

@tixxit I suspect we can pretty easily write a script (in Scala or even Ruby or something) that translates the JSON.

tixxit commented 8 years ago

@avibryant The problem is that we need to support both formats in production for a short time. Upgrading the affected models should be relatively easy (eg re-running a job + version bump).

tixxit commented 8 years ago

@NathanHowell Will the serialization format change make your life much harder? If not, then we can just support the old format from within our model server for the change over, then delete the code.

erik-stripe commented 8 years ago

Yeah I imagined writing a script to update our models. And like @tixxit says, I think writing a custom injection internally that can parse either format (temporarily) is a way to switch over our services to the new code without breaking anything.

NathanHowell commented 8 years ago

Is it possible to ease the use of binary splits without completely removing the ability to have n-way splits?

I've been working on implementing an MDLP splitter for TDigest (based on http://ijcai.org/Past%20Proceedings/IJCAI-93-VOL2/PDF/022.pdf) that will generate multiple splits per node. The goal would be to reduce tree depth (and training iterations, important for very large training sets) without affecting model quality. It is not universally applicable though. Orthogonal trees/forests may similarly benefit from multiway splits.

erik-stripe commented 8 years ago

@NathanHowell First of all, sorry for opening a PR removing a feature you need.

Just out of curiosity, is there a problem with encoding a mult-way split as a series of binary splits? The memory/performance overhead of this should be pretty much the same as it was under the previous implementation which used a collection of triples. I don't think it should make the trees any bigger, as far as allocations or size in memory goes. (And the big-O cost of traversing a series of binary nodes should be the same as traversing a collection of the same size.)

We could provide a nice constructor to make it easy to instantiate a "logical" multi-way split. I could also imagine displaying a tree of many splits on the same feature as a multi-way split logically (sort of the dual of the previously mentioned constructor). Is there anything else that would be needed?

I think we can definitely support an explicit multi-way split node if needed but I want to be sure I understand the requirement first.

NathanHowell commented 8 years ago

@striation it could certainly be a tree of splits instead of a single split, that is fine. I was just imaging that there would be implications for tree expansion to fill in targets (and possibly annotations) for each split point, even if they're not a leaf.

avibryant commented 8 years ago

This could be a follow-up PR, but my proposal is that we get rid of Split and instead have Splitter directly produce Nodes, which could then be a series of binary splits to simulate a multi-way. We then need a way for Evaluator to evaluate a Node; my proposal is that its interface changes to take Iterable[T], which is just the Ts living in all the leaves of a Node, regardless of how deep the tree might be.

Incidentally this makes it very clear that Evaluator should actually be called TrainingError, and could usefully be applied to a whole tree...

avibryant commented 8 years ago

We have three big pull requests in flight - this one, https://github.com/stripe/brushfire/pull/74, and now https://github.com/stripe/brushfire/pull/77 . @NathanHowell @striation any thoughts on the right order to try to land them in?