pharmbio / sciluigi

A light-weight wrapper library around Spotify's Luigi workflow library to make writing scientific workflows more fluent, flexible and modular
http://dx.doi.org/10.1186/s13321-016-0179-6
MIT License
332 stars 57 forks source link

Parameters vs Inputs #53

Open multimeric opened 6 years ago

multimeric commented 6 years ago

In this example, I notice that you use both in_ parameters (in_foo = None), and normal luigi parameters (replacement = sciluigi.Parameter()). What is the actual difference here? When do I define an input as an in_ vs making it a sciluigi.Parameter?

samuell commented 6 years ago

Hi @TMiguelT ,

"Normal" parameters are for data that can be passed as simple values (strings, numerical integer or float values, booleans etc), while the in_ type of inputs, are for things that need to be saved to a file before passing on between tasks.

Did that answer your questions?

multimeric commented 6 years ago

Thanks that helps a bit. Can you connect parameters to out functions of other tasks in sciluigi, or just in fields?

samuell commented 6 years ago

Can you connect parameters to out functions of other tasks in sciluigi, or just in fields?

Only in_ -fields

multimeric commented 6 years ago

But what if I want to specify a non-file parameter using the output of the previous job? I can't?

samuell commented 6 years ago

But what if I want to specify a non-file parameter using the output of the previous job? I can't?

Ah, yea, this is one thing that is not so easy with Luigi/Sciluigi, unless you can write that output to a file somehow, and read from this file in your downstream parts of the workflow.

What we did when we needed this before, was to put the part of the workflow being fed with calculated parameter values in a separate workflow, and call this workflow as a separate python file. An example where we do this is here (The use case workflow to the sciluigi paper). So, that whole MainWorkflowRunner task is just a wrapper around a python command executing a separate, parametrized, workflow instance.

Footnote: This whole problem is related to the fact that Luigi does scheduling and execution in two separate phases, and that parameter values need to be set during the scheduling phase. This means they can't be obtained during execution, since then the scheduling is already done. This is one reason why we have lately gravitated toward full dataflow-based workflows, where scheduling and execution is done simultaneously, and to this end are developing the scipipe engine instead.