stripe-archive / brushfire

Distributed decision tree ensemble learning in Scala
Other
391 stars 50 forks source link

cleanup Trainer and forceToDisk training data #6

Closed avibryant closed 9 years ago

avibryant commented 9 years ago

This standardizes what Trainer does with Execution, and in the process lets us forceToDisk the trainingData so that we don't have to recompute it on each pass.

The idea now is that Trainer keeps 4 separate Executions:

The latter three each have corresponding methods (flatMapSampler, flatMapTrees, and tee, respectively) which zip together the relevant input executions (eg, for flatMapSampler, this is sampler and trainingData), and then call a function which should produce an updated Execution of the respective type (or in the case of tee, of any type, since we don't care about the result).

The Trainer's execution method which ultimately becomes the root of all of this zips together unitExecution and treeExecution (the trainingData and sampler are assumed to only be interesting as dependencies of those two).

This would also make it easy to allow for transformations on the trainingData if that ever seems useful (the obvious case here is reweighting it to allow boosting).

cc @snoble @danielhfrank