stripe-archive / brushfire

Distributed decision tree ensemble learning in Scala
Other
391 stars 50 forks source link

Downsample large nodes for in-memory splitting? #9

Closed avibryant closed 9 years ago

avibryant commented 9 years ago

Currently, Trainer's expandSmallNodes will just ignore any leaves that are sufficiently large (ie, that would have been split by the distributed expand, if it were given the chance). Given that you are likely to stop the distributed splitting at some point (and so it won't get the chance), you end up in this somewhat strange dynamic where the largest (and thus perhaps most important) nodes don't get fully expanded but the smaller, possibly less important ones do.

An alternative would be to have the in-memory algorithm expand everything, but downsample at each node (individually computing the rate based on the current leaf's target) to make sure that each one can fit into memory. This would at least get a few more levels of depth for those nodes, although they wouldn't go as deep as a true, distributed full expansion would. The distributions of the leaves this would create would be underweighted relative to stuff that hadn't been downsampled, but a) it's not clear that matters much and b) they could be fixed up with an updateTargets at the end if desired.

It's also interesting to note that you could iterate this, with progressively less downsampling needed each time, or even build the whole tree this way, especially if you only did a small number of levels each time, though I think it would be both more expensive and less effective than the current distributed approach.

Thoughts @snoble @ryw90 @mlmanapat ?

snoble commented 9 years ago

So the sampler would need to update the distribution of the node it's about to expand for sparse features to work correctly? Anyhow I'm pretty bullish on this. Especially if this can result in a tutorial where someone can build a tree without having a hadoop cluster (or even have scalding).

I can imagine wanting different down sampling strategies for training and validation... where for validation you're sampling agnostic of label but with training you attempt to balance the labels. This would of course want to be abstracted.

wgyn commented 9 years ago

Seems sensible to me. Having an option to fix the weights seems potentially useful e.g. you could still update the leaf distributions at a later time.

avibryant commented 9 years ago

@ryw90 right that's what updateTargets does (this already exists).

wgyn commented 9 years ago

Sorry should have been clearer, was just agreeing with your note about updateTargets.