Use case: for a task (such as training a model) that uses a tunable number of iterations, such that intermediate results are saved and used to initialize for subsequent iterations, and can also be used directly by downstream tasks.
A trivial but ugly solution would be creating a different task for each iteration, and using a branch point on the downstream task to select the output of one of those previous tasks.
The proposed pattern, “self-grafting”, would instead feed the output of one task into a subsequent realization of the same task by way of a branch graft. For example:
$ cat selfgraft.tape
task preproc > trainingdata devdata {
echo "" > $trainingdata
echo "" > devdata
}
task learn < in=$trainingdata@preproc init=(I: 0=/dev/null 1=$model@learn[I:0] 2=$model@learn[I:1] 3=$model@learn[I:2] 4=$model@learn[I:3]) > model {
echo "./train --data $in --init-model $init" > $model
}
# can be run with any value of I
task predict_eval < in=$devdata@preproc model=@learn > preds scores {
echo "./predict --data $in --model $model" > $preds
echo "./eval --data $in --preds $preds" > $scores
}
This is not perfect because it still requires manually specifying the inputs for each iteration. But it is more compact than having a bunch of tasks, and conceptually, it seems to me like it should work: though the learn task takes its own output as an input, it is strictly from a completed realization, so the dependencies are correctly specified.
Currently, static analysis succeeds but an error occurs when running the workflow:
$ ../ducttape selfgraft.tape -j4
ducttape 0.3
by Jonathan Clark
Loading workflow version history...
Have 0 previous workflow versions
No plans specified in workflow -- Using default one-off realization plan: Each realization will have no more than 1 non-baseline branch
Checking for completed tasks
Finding packages...
Found 0 packages
Checking for already built packages (if this takes a long time, consider switching to a local-disk git clone instead of a remote repository)...
Checking inputs...
Work plan (depth-first traversal):
RUN: /Users/nathan/dev/nlp-tools/ducttape-0.3/testexamples/./preproc/Baseline.baseline (Baseline.baseline)
RUN: /Users/nathan/dev/nlp-tools/ducttape-0.3/testexamples/./learn/Baseline.baseline (I.0)
RUN: /Users/nathan/dev/nlp-tools/ducttape-0.3/testexamples/./predict_eval/Baseline.baseline (I.0)
RUN: /Users/nathan/dev/nlp-tools/ducttape-0.3/testexamples/./learn/I.1 (I.1)
RUN: /Users/nathan/dev/nlp-tools/ducttape-0.3/testexamples/./predict_eval/I.1 (I.1)
RUN: /Users/nathan/dev/nlp-tools/ducttape-0.3/testexamples/./learn/I.2 (I.2)
RUN: /Users/nathan/dev/nlp-tools/ducttape-0.3/testexamples/./predict_eval/I.2 (I.2)
RUN: /Users/nathan/dev/nlp-tools/ducttape-0.3/testexamples/./learn/I.3 (I.3)
RUN: /Users/nathan/dev/nlp-tools/ducttape-0.3/testexamples/./predict_eval/I.3 (I.3)
RUN: /Users/nathan/dev/nlp-tools/ducttape-0.3/testexamples/./learn/I.4 (I.4)
RUN: /Users/nathan/dev/nlp-tools/ducttape-0.3/testexamples/./predict_eval/I.4 (I.4)
Are you sure you want to run these 11 tasks? [y/n] y
Exception in thread "main" java.lang.RuntimeException: Task not found: learn/Baseline.baseline/1
at ducttape.versioner.WorkflowVersionStore$.dependencies(WorkflowVersionStore.scala:136)
at ducttape.versioner.WorkflowVersionStore$$anonfun$6.apply(WorkflowVersionStore.scala:177)
at ducttape.versioner.WorkflowVersionStore$$anonfun$6.apply(WorkflowVersionStore.scala:177)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at ducttape.versioner.WorkflowVersionStore$.create(WorkflowVersionStore.scala:177)
at ducttape.versioner.TentativeWorkflowVersionInfo.commit(WorkflowVersionInfo.scala:101)
at ducttape.cli.ExecuteMode$.run(ExecuteMode.scala:120)
at Ducttape$$anonfun$main$8.apply(ducttape.scala:879)
at ducttape.cli.ErrorUtils$.ex2err(ErrorUtils.scala:59)
at Ducttape$.main(ducttape.scala:572)
at Ducttape.main(ducttape.scala)
Use case: for a task (such as training a model) that uses a tunable number of iterations, such that intermediate results are saved and used to initialize for subsequent iterations, and can also be used directly by downstream tasks.
A trivial but ugly solution would be creating a different task for each iteration, and using a branch point on the downstream task to select the output of one of those previous tasks.
The proposed pattern, “self-grafting”, would instead feed the output of one task into a subsequent realization of the same task by way of a branch graft. For example:
This is not perfect because it still requires manually specifying the inputs for each iteration. But it is more compact than having a bunch of tasks, and conceptually, it seems to me like it should work: though the learn task takes its own output as an input, it is strictly from a completed realization, so the dependencies are correctly specified.
Currently, static analysis succeeds but an error occurs when running the workflow: