Closed avibryant closed 9 years ago
So the sampler would need to update the distribution of the node it's about to expand for sparse features to work correctly? Anyhow I'm pretty bullish on this. Especially if this can result in a tutorial where someone can build a tree without having a hadoop cluster (or even have scalding).
I can imagine wanting different down sampling strategies for training and validation... where for validation you're sampling agnostic of label but with training you attempt to balance the labels. This would of course want to be abstracted.
Seems sensible to me. Having an option to fix the weights seems potentially useful e.g. you could still update the leaf distributions at a later time.
@ryw90 right that's what updateTargets
does (this already exists).
Sorry should have been clearer, was just agreeing with your note about updateTargets
.
Currently, Trainer's
expandSmallNodes
will just ignore any leaves that are sufficiently large (ie, that would have been split by the distributed expand, if it were given the chance). Given that you are likely to stop the distributed splitting at some point (and so it won't get the chance), you end up in this somewhat strange dynamic where the largest (and thus perhaps most important) nodes don't get fully expanded but the smaller, possibly less important ones do.An alternative would be to have the in-memory algorithm expand everything, but downsample at each node (individually computing the rate based on the current leaf's target) to make sure that each one can fit into memory. This would at least get a few more levels of depth for those nodes, although they wouldn't go as deep as a true, distributed full expansion would. The distributions of the leaves this would create would be underweighted relative to stuff that hadn't been downsampled, but a) it's not clear that matters much and b) they could be fixed up with an
updateTargets
at the end if desired.It's also interesting to note that you could iterate this, with progressively less downsampling needed each time, or even build the whole tree this way, especially if you only did a small number of levels each time, though I think it would be both more expensive and less effective than the current distributed approach.
Thoughts @snoble @ryw90 @mlmanapat ?