stripe-archive / brushfire

Distributed decision tree ensemble learning in Scala
Other
391 stars 50 forks source link

WIP: TrainingStep #77

Open avibryant opened 8 years ago

avibryant commented 8 years ago

attn @tixxit @non

The intent here is to capture the basic mechanics of various training steps - updateTargets, expand, prune, etc - in a way that can be reused in multiple execution environments (local, scalding, spark, ...).

For now, this is all added directly to and used only by the Local trainer. The near-term impact is that the local trainer will stream over its input data in the same way that the distributed trainers do, rather than requiring it to all be loaded into memory. For this PR to be complete, we should move the training steps to their own module (maybe also doing https://github.com/stripe/brushfire/issues/51), and refactor the scalding trainer to use them.

It's very possible that this is too much or too little abstraction - right now it seems a bit overfit to the needs of the specific training steps and platforms we support, and I suspect it will be brittle going forward. (In fact, featureImportance already doesn't work with this, though I can argue that we should move to a TreeTraversal-based strategy for that which would). At the same time, I think some approach like this will be valuable going forward, and I think it's better to start going imperfectly down this path.

tixxit commented 8 years ago

I really like the idea overall. Hard to tell if it is too overfit, but I agree that we need to start somewhere!

I'm sort of giving some nit-picky comments to start, but will hopefully give some more useful ones as I understand the abstraction better.

avibryant commented 8 years ago

BTW one thing I'm kinda grumpy about is the distinction between TrainingStep and OutputStep. I wanted each step to be able to either produce new trees or some sidechannel output or both (for example, in the long run I'd really like to compute out-of-band error during the expand step). But ValidationStep really does look structurally quite different; it doesn't care about per-leaf or even per-tree, and is instead just computing a single value across the whole forest. A notional FeatureImportanceStep would be similar. So I'm not sure how best to model that possibility.

avibryant commented 8 years ago

This is very WIP still, but I've gone forward with the brushfire-training reorg, because having TrainingStep gives me somewhere to land prune and expandInMemory outside the Tree which is still reusable, which was previously a blocker for that. This probably ends up as an overly large PR, but oh well.