stripe-archive / brushfire

Distributed decision tree ensemble learning in Scala
Other
391 stars 50 forks source link

Support multiple tree implementations with Bonsai #79

Closed erik-stripe closed 8 years ago

erik-stripe commented 8 years ago

This is a relatively ambitious pull request. The basic idea is to allow Brushfire to work with many different tree representations, instead of requiring users to use its Tree and AnnotatedTree types. This flexibility is provided by Bonsai's TreeOps (and FullBinaryTreeOps) type classes, which provide methods to work with trees in a generic way.

One concrete benefit of this is that users can use the compressed trees provided by Bonsai instead of Brushfire's trees. This also means it should be easier to use Brushfire to train models used in other systems (although I haven't actually done this so I don't have a specific example).

Benchmarking I've done shows that there is no major performance difference here -- some of the optimizations in the PR offset the costs of a more generic model. I think there are probably places we could improve further, but for now this seems reasonable.

I should acknowledge that I still haven't fully answered @NathanHowell's concerns from #76. I would be happy to try the changes @avibryant's suggested if everyone agrees that this is the right way to go.

This PR includes and supercedes #76. I could also rebase this PR against #76 if we would prefer to merge that first.

avibryant commented 8 years ago

Just to be clear about the scope here: it doesn't make it possible to train using multiple representations (which is a good thing, I think); it just introduces a generalized traversal abstraction that we can use in other contexts (like evaluation) with other tree representations. Right?

erik-stripe commented 8 years ago

@avibryant Right, exactly.

erik-stripe commented 8 years ago

@johnynek I'm happy to make the seed an explicit parameter -- someone else will probably seed with the clock, but that's OK (these don't need to be cryptographically secure AFAIK).

erik-stripe commented 8 years ago

I think the initial goal with Reorder was to make the instances more reusable. Since that goal has failed I'll look into "inlining" more of the rest of it in the instance.

erik-stripe commented 8 years ago

So to answer @johnynek's question in a long-winded sort of way:

It's difficult to know the N type at the point we are currently creating Reorder instances. This is because of the the way we use path-dependent types to specify the node type of a tree. The f is just collateral damage. I tried to refactor things to avoid this, but didn't come up with anything that worked.

I'll just comment this for now. It's possible that we could fix the design (and if we're unhappy enough with the current one I'll take a crack) but it would likely require some changes to bonsai to happen.

erik-stripe commented 8 years ago

@johnynek What do you think? Seems reasonable?

johnynek commented 8 years ago

:+1:

tixxit commented 8 years ago

Awesome. I think the only thing I think we should do is bump the minor version at least, so that we can indicate a fairly big change (and also possibly release patches to the previous version should migration prove difficult, though I don't suspect we'll have too much trouble).

@NathanHowell Any concerns with this plan?

erik-stripe commented 8 years ago

It seems like everyone has signed-off on this, and given that we do have a plan to deal with multi-way splits in the future, I'm going to merge this. Thanks everyone!