jhclark / ducttape

A workflow management system for researchers who heart Unix.
http://jhclark.github.com/ducttape
Other
117 stars 14 forks source link

Making config files more powerful by overriding and inserting tasks #60

Open jhclark opened 12 years ago

jhclark commented 12 years ago

Consider this workflow:

task tokenize < corpus=@ > tok {}
task learn < tok=@tokenize {}

A few use cases from MT pipelines:

1) Your basic workflow contains a Western tokenizer, but you'd like to replace it with a Japanese segmenter instead. Just for one configuration. (Overriding tasks in a workflow) 2) Your basic workflow contains a Western tokenizer, but you'd like to add a Japanese segmenter after the tokenizer. Just for one configuration. (Inserting tasks into the middle of a workflow)

Here's a few changes that would make this fairly natural:

1) Allow globals to be outside of global blocks, making config files have the same syntax as .tape files. This also implies that new tasks can be defined in config files. 2) Add the "override" keyword for tasks:

Config:

override task tokenize < corpus=@ > tok {}

3) To allow inserting steps inside the workflow, we must allow overriding individual outputs. This should be made explicit with some keyword such as "insertion". (WARNING: this has the potential to be confusing when reading the original workflow since we have to specify how the inputs and outputs get remapped)

An insertion task that replaces an output might look like this:

insertion task segment < in=tok@tokenize > out=tok@tokenize {}

An insertion task that replaces an output might look like this:

insertion task segment < in=tok@tokenize > out=tok@learn {}

Having outputs that are task variables will only be allowed for insertion tasks. In both of these cases, overriding the input and overriding the output are equivalent. However, this might not be the case when multiple tasks depend on the same output of some task.

It should be explicitly noted in the documentation that having many insertion tasks is considered bad style and is not easily maintainable due to the difficulty of figuring out how things got overridden. We should supply visualization and analysis tools to ease this. However, I think this has the potential to be a powerful construct.

jhclark commented 12 years ago

Another example use case: In the cdec example workflow, I have the workflow downloading some example data to build a system on. I'd like to have examples for several language pairs, so obviously each example will need a custom download task.