This standardizes what Trainer does with Execution, and in the process lets us forceToDisk the trainingData so that we don't have to recompute it on each pass.
The idea now is that Trainer keeps 4 separate Executions:
one for the trainingData, which currently is expected to never change
one for the Sampler, which can be updated (ie to compute an OutOfTime threshold)
one for the tree, which can be updated (to load or expand)
one unitExecution which just accumulates side-effects (like computing and saving an error)
The latter three each have corresponding methods (flatMapSampler, flatMapTrees, and tee, respectively) which zip together the relevant input executions (eg, for flatMapSampler, this is sampler and trainingData), and then call a function which should produce an updated Execution of the respective type (or in the case of tee, of any type, since we don't care about the result).
The Trainer's execution method which ultimately becomes the root of all of this zips together unitExecution and treeExecution (the trainingData and sampler are assumed to only be interesting as dependencies of those two).
This would also make it easy to allow for transformations on the trainingData if that ever seems useful (the obvious case here is reweighting it to allow boosting).
This standardizes what Trainer does with Execution, and in the process lets us
forceToDisk
the trainingData so that we don't have to recompute it on each pass.The idea now is that Trainer keeps 4 separate Executions:
unitExecution
which just accumulates side-effects (like computing and saving an error)The latter three each have corresponding methods (
flatMapSampler
,flatMapTrees
, andtee
, respectively) whichzip
together the relevant input executions (eg, for flatMapSampler, this is sampler and trainingData), and then call a function which should produce an updated Execution of the respective type (or in the case oftee
, of any type, since we don't care about the result).The Trainer's
execution
method which ultimately becomes the root of all of this zips togetherunitExecution
andtreeExecution
(the trainingData and sampler are assumed to only be interesting as dependencies of those two).This would also make it easy to allow for transformations on the trainingData if that ever seems useful (the obvious case here is reweighting it to allow boosting).
cc @snoble @danielhfrank