Open hbredin opened 8 years ago
The question that ensues is somehow related to #18 (rather a generalization of it).
Since paths are now generated automatically, it is not that obvious to know where output files were written (this is true even without the AutoOutput
mixin)
Is there an easy way to access the output of all tasks constituting the workflow. I tried the following
def getAllOutputs(workflow):
outputs = {}
for instance_name, task in six.iteritems(workflow._tasks):
outputs[instance_name] = task.out_put().path
return outputs
workflow = MyWorkflow()
outputs = getAllOutputs(workflow)
But it looks like, at this point, tasks constituting the workflow (workflow._tasks
) are not instantiated yet, and all we got are output paths based on default parameter values.
What does workflow._tasks
contains exactly?
(Hi, sorry, have been a bit busy, will look at this now!)
I had difficulty finding a good naming convention for all my out_xxxx path, when my workflow would become complicated (e.g. with one task taking three other tasks as input: how should I name its output?)
This is hard to say without a concrete example. We have had cases where we often have multiple outputs, so it has been central for us to give each output a unique name and thus "identity". In cases where we had a single output, we have kept with the same pattern and tried to give a descriptive name, such as .out_concatenated
, or .out_traindata
, or .out_testdata
.
But it looks like, at this point, tasks constituting the workflow (workflow._tasks) are not instantiated yet, and all we got are output paths based on default parameter values.
Yea, without knowing this for sure in this case without testing, I often found problems with the fact that Luigi separates scheduling and workflow execution in two phases, and so tasks are not fully instantiated until the scheduling phase is finished and the execution started.
Our biggest problem with this is that makes it hard for example to initiate a new task with parameter values calculated by a previous task, since parameter values need to be provided at scheduling time, and scheduling time is over after the execution starts. As a side note, this is one reason why we are experimenting with a fully dataflow-based approach in scipipe, where scheduling and execution can happen interchangeably (but it's not production ready yet).
Is there an easy way to access the output of all tasks constituting the workflow. I tried the following
Will have to test a little before getting back on this, and the other remaining questions. Will get back to you shortly!
FYI, I ended up saving every automagically generated output paths in an attribute of the parent workflow: https://github.com/pyannote/pyannote-workflows/blob/master/pyannote_workflows/utils.py#L56-L68
I had difficulty finding a good naming convention for all my
out_xxxx
path, when my workflow would become complicated (e.g. with one task taking three other tasks as input: how should I name its output?)Therefore, I have created a
sciluigi.Task
mixin calledAutoOutput
that would automatically add anout_put
method to a task (see below). Maybe it can be useful for others...All you have to do to use it is the following:
workdir
luigi.Parameter
to theWorkflowTask
AutoOutput
mixin to the task you are adding to the workflowIt does have a few limitations, the main one being that it does not support tasks with structured inputs.
This will work:
This will not work:
Here is the code of the
AutoOutput
mixin: