vatlab / sos

SoS workflow system for daily data analysis
http://vatlab.github.io/sos-docs
BSD 3-Clause "New" or "Revised" License
268 stars 44 forks source link

Support for named input #1106

Closed BoPeng closed 5 years ago

BoPeng commented 5 years ago

http://bionics.it/posts/workflows-dataflow-not-task-deps

# Run the same task on the two splits
t1 = Task1()
t2 = Task1()
t3 = Task2(
        in1 = t1.outspec('out1')
        in2 = t1.outspec('out2')
        in3 = t2.outspec('out1')
        in4 = t2.outspec('out2'))

So t3 gets input in1, in2, in3, in4 as step t1's out1, out2, out1, out2 respectivelyl.

What we have is

[t1]

[t2]

[t3]
input: from_steps=['t1', 't2']

and we cannot differentiate individual outputs from steps t1 and t2.

BoPeng commented 5 years ago

We do have

[t3]
input: {
    'name1': filelist1
    'name2': filelist2
}

but name1 and name2 cannot be used directly and have to be used through _input['name1'] etc.

Our implementation of input uses positional arguments for file list, but this can be relaxed as

[t3]
input: name1=filelist1, name2=filelist2

as long as name1 etc do not conflict with our existing options such as group_by.

Also, we could define name1 and name2 in addition to _input (but still allows _input['name1'] etc because our group_by etc will depend on _input.

The bigger problem is with the out because we do not have named output yet.

BoPeng commented 5 years ago

Suggestion:

Allow named input

Allow syntax

[t3]
input: name1 = filelist1, name2=filelist2

Allow access of sources as

_input.name1, _input.name2, ...

and abandon the existing _input['name1'] and _input['name2'] which is a lot harder to type.

extend from_steps

Remove keyword option from_steps and make it a function, something like

input: output_of('step')

or

input: name1=output_of('step2')

Allow named output

output: fastq='something', summary='result.txt'

This is easy enough to do when there is no substeps, and we can allow

_output

as all output, and

_output.fastq

as one of the output.

Now, when we have named output, we can have

step_output.fastq

as accumulation of _ouput.fastq.

gaow commented 5 years ago

but name1 and name2 cannot be used directly and have to be used through _input['name1'] etc.

What is the problem with this approach? I think we have named input in place by that definition already, and we only worry about named output?

Also we already have,

[t3]
name1=filelist1
name2=filelist2
input: name1, name2

right? Now when group_by is used, these names are not directly usable in a meaningful way in a workflow step. But there are many cases without group_by that our existing approaches can already handle for the input part. I doubt we need new syntax to use these variables in a step.

BoPeng commented 5 years ago

You are right about

[t3]
name1=filelist1
name2=filelist2
input: name1, name2

The proposed syntax

[t3]
input: name1=filelist1, name2=filelist2

will allow _input['name1'] but I am not sure if we should expose name1 directly in this case. snakemake seems to be using _input.name1 which is more succinct.

In any case, we will need to have a single _input to make group_by work, but we might have to expand group_by to allow

input: fastq='*.fastq', ref='ref.fa', group_fastq_by='1'
sh:
  {_input.fastq} {_input.ref}

so that only fastq is grouped, not ref.

gaow commented 5 years ago

_input.name1, _input.name2, ...

Okay I'm cool with that (in fact neutral), as long as we still have _input in there which makes the syntax very clear. I do not think _input['name1'] is too bad.

extend from_steps

Conceptually it makes it hard (at least for me) to think about group_by behavior based on source -- how do we deal about it then? We also have group_by in output. It feels like group_by behavior has to be changed too.

gaow commented 5 years ago

Oh okay seems our messages are crossing but thoughts are converging ... I think group_fastq_by='1' you said above is more like some for_each extension. In light of that, maybe along the lines of group_by = ('fastq', 1)?

BoPeng commented 5 years ago

Yes, we have kept the group_by option to a single parameter, and it seems ok to allow group_by=('fastq', 1). I wrote group_fastq_by because it reads better and writes better, and we are the creator and can make it happen.

Things can get complicated. For example in our code example you might want to group the first named source by 2 and second named source by 1, so something like

input: out1=output_from('step1'), out2=output_from('step2'),
   group_out1_by=2, group_out2_by=1

which is actually not too bad...

gaow commented 5 years ago

sure, I agree it reads good by itself. But do we want to be somewhat consistent with other options? eg for_each is an option that takes a string, a list and dictionary {(n,p): [_n < _p for _n in n for _p in p]}. The last one is certainly flexible but does not read better. But they are all for_each. Maybe it is easier for users to learn, if we can take this opportunities to somewhat unify flavors of for_each and group_by rather than making them more different. We can also remove some of the paring based on source syntax.

BoPeng commented 5 years ago

Yes we need to think through these. These options were simple at first and it is getting more and more complicated, but hopefully we can keep them easy to use.

But

for_each_i=range(5)

does not read quite good because we have the simplified version of

for_each='i'

for _i.

BoPeng commented 5 years ago

I felt that the use of for_each='i', which generates _i is a bit magically and should be deprecated. In all other flavor of loops we have for_each over a loop, so perhaps

samples = ('a', 'b')
for_each_sample = samples

is not too bad because at least we introduced a way to explicitly set variable name.

gaow commented 5 years ago

so maybe we keep to having only for_each and group_by, but make these two very similar in flavor for various extensions. Also think of some really simple but common users cases (eg like from that blog) and make sure our mechanism does not make simple cases more complicated?

I felt that the use of for_each='i', which generates _i is a bit magically and should be deprecated.

but I think conceptually we have one for_each which takes string, list or a dictionary. This is intuitive to remember. for_each_sample constructs the LHS of a statement which indeed simplified the RHS, but is not that intuitive on tthe LHS because I expect programmers usually would like to focus on the RHS however ugly they are ...

gaow commented 5 years ago

Also when typed in full, the existing:

i = range(5)
for_each = 'i'

and the proposed:

for_each_i=range(5)

the proposed made the parameter i less explicit -- are you cool with that? I think it is better to be more explicit by at least placing on the RHS.

BoPeng commented 5 years ago

I know both arguments. I have written

input: for_each={'i': range(5)}

for so many times for testing and I had hopefully we could make it easier. We can introduce the following

input: for_each_i=range(5)

as a syntax sugar for these kinds of thing but I know we pythonist should not crave for it.

BoPeng commented 5 years ago
i = range(5)
for_each = 'i'

is not good because

  1. It does not show what the loop variable is (_i is our convention).
  2. the use of 'i' to refer to a variable name i is not quite natural because as I have said programmers are familiar with for_each over a loop, not a name.

This is why I have preferred to use

for_each={'i': range(5)}

to explicitly set variable name and things to iterate over, but it is just awkward to type it.

gaow commented 5 years ago

is not good because

I should have written:

parameter: i = range(5)
for_each = 'i'

with the potential to change i from command input. Of course interally we can still have for_each = {'i': i}.

as a syntax sugar for these kinds of thing but I know we pythonist should not crave for it.

Since SoS positions itself Python-based thus the low learning curve, maybe we want to be consistent about that?

BoPeng commented 5 years ago

Perhaps something like:

_i = for_each(i)
sample = for_each(samples)

and we tell user that for_each is some sort of generator and _i and sample would take one value each time?

The key here is that using a keyword parameter syntax users know that _i and sample are names.

BoPeng commented 5 years ago

We can also have something in effective of

fastq=group_by(step_input.fastq, 2)
BoPeng commented 5 years ago

So in the end we have a bunch of input-only functions (I would not make it available outside of input) that creates names that can be used in the content of the step.

gaow commented 5 years ago

So in the end we have a bunch of input-only functions

But to me you are proposing for_each() to give a new data type, a generator, that we know conceptually will be used for input and naturally extends to output, implicitly. For example:

parameter: samples = list()
input: sample = for_each(samples)
bash: 
  echo {sample}

will run the bash script len(samples) times since sample is a generator? It is way too subtle to grasp.

BoPeng commented 5 years ago

Updated your example.

So as I said, the keyword arguments would generate names that can be used in the content, with the meaning of names determined by the RHS.

For example

input: fastq=output_from('step_2')

would allow the use of _input.fastq (even fastq if we decide so) in the script.

input: _i = for_each(range(5))

would allow the use of _i in the script.

input: fastq=group_by(output_from('step2'), 1)

would allow the use of fastq as one of the output of step2 one by one.

Right now I do not see anything particularly difficult to digest, and we can keep our group_by options for a while.

gaow commented 5 years ago

so output_from and group_by will return file targets, but for_each will return parameters? The LHS does not reflect these differences. It still feels subtle to me. What would this be like in the new syntax, for example?

parameter: filename = str # or path()
parameter: a = list
parameter: b = list
input: filename, for_each = ['a', 'b'], group_by = 1

From current syntax it is at least clear that group_by happens after for_each parameter and all accompanying filename.

BoPeng commented 5 years ago

Something like

parameter: filename = str # or path()
parameter: a = list
parameter: b = list
input: 
        _a=for_each(a),
        _b=for_each(b),
        _input=group_by(filename, 1)

if we use pure new syntax? Or if we consider default input,

parameter: filename = str # or path()
parameter: a = list
parameter: b = list
input: 
        _a=for_each(a),
        _b=for_each(b),
        _input=group_by(1, filenames)

I understand it is difficult to achieve nested for_each in this way.

gaow commented 5 years ago

well, I guess this is what I meant earlier by trying to not make simple case more complicated. The new syntax seems to complicate things, and it is not very clear that group_by will loop over each a and b for input file filename.

BoPeng commented 5 years ago

But at leas it is more explicit, right? There is less guess about _a, _b, and _input, and at least in the case of

input: for_each={'i': range(5)}

the new syntax

input: i=for_each(range(5))

is more natural because it reads like i is taking one at a time of an array.

And the explicitly helps in more complex cases such as

input: 
    fastq=group_by(output_from('step1'), 2),
    bam=group_by(output_from('step2'), 1),
    ref='reference'
sh: expand=True
  do something with {fastq} and {bam} with {ref} 

because each piece of input is defined by itself.

gaow commented 5 years ago

And the explicitly helps in more complex cases such as

Certainly ... but in this example the LHS of = are all input files. Mixing in i = for_each() is still awkward. Should one expect to access i via _input.i?

Also does it help to also outline the plan for named output and see if the theme match?

BoPeng commented 5 years ago

In terms of output, we can do something like:

input: fastq=group_by('*.fastq', 1)
output: bam=fastq.with_suffix('.bam')

Here fastq is similar to _input, and we allow the assignment of _output to bam.

I think we can keep the use of _input and _output for "default" input and output but use these syntax for named input and output, although it is currently unclear how to mix them together. For example,

output_from('step').bam

would be named output, and

output_from(step)

would be _output, but what is _output with the presence of bam?

If we keep our current _input (and _output) + sources implementation,

output_from('step')

would be the complete _output (actually step_output), and

output_from('step').bam

would be step_output.bam. and step_output still represents multiple outputs if there are multiple named outputs. The same goes to _input.

BoPeng commented 5 years ago

_input.i is actually not a bad idea but it allows us to pass all information belong to an input group together. In this case,

sample_names=['sample1', 'sample2']
input: 'ref.fa', fastq=group_by(['f1.fastq', 'f2.fastq'], 1), a = for_each(sample_names)

we can have an _input with components

step_input = 'ref.fa', 'f1.fastq', 'f2.fastq'
step_input.fastq = 'f1.fastq', 'f2.fastq' .
_input = 'ref.fa', 'f1.fastq'
_input.fastq = 'f1.fastq'   # the same as _input['fastq']
_input[''] = 'ref.fa` # if really needed
_input.a = 'sample1'

It also avoids the problem with

a = range(5)
input: a=for_each(a)

because the loop a would be _input.a, and using some magic we can write

a = range(5)
input: for_each(a), group_by('*.fastq')

where for_each(a) would assign to _input.a and group_by('*.fastq') would assign to_input`.

BoPeng commented 5 years ago

In generally what we did for from_steps and _input['source'] had named input and output almost there because from_steps='a' refers to output of a step, not just a step, there will be a lot of work for the new syntax though.

gaow commented 5 years ago

would be _output, but what is _output with the presence of bam?

Maybe it is less confusing to only allow for [0] or ['bam'] syntax, not the dot syntax. And _input and _output will be a flat list of values, as if they are values from ordered dict?

for_each(a)

I'm cool with this syntax, and using for_each(a,b,c) if possible. but how about our old syntax: for_each = {(n,p): [(_n, _p) if _n < _p ...]}?

I guess I'm still not convinced or motivated we should change behavior of for_each, although we should try figure out what to do with group_by in light of named input and output.

BoPeng commented 5 years ago

The for_each stuff handles variables, which can be completely separated from the named input/output issues, so the only thing we need to worry about is group_by.

Whereas

input: unnamed, name1=whatever, name2=whatever2, group_by=('name1', 2)

would work,

input: unnamed, name1=group_by(whatever, 2), name2=whatever2

seems to make more sense, although implementation-wise this syntax makes it difficult to determine step_input since this syntax technically specifies _input.name1 directly.

gaow commented 5 years ago

The for_each stuff handles variables, which can be completely separated from the named input/output issues, so the only thing we need to worry about is group_by.

agreed. Also agreed that the 2nd syntax is more intuitive. And we replace from_steps = to output_from()? How would you translate the example

[lasso_3, ridge_3]
input: from_steps = [1, 2], group_by = 'pairsource2', concurrent = True
output: f"{_input[2]:nn}.mse.csv"

to the new interface? (not simply pairing, but by groups of 2 here)

BoPeng commented 5 years ago

Not sure how to group, but would be something like

[lasso_3, ridge_3]
input: s1=group_by(output_from(1), 1), 
       s2=group_by(output_from(2), 2), concurrent = True
output: f"{_input.s1:nn}.mse.csv"

The syntax is more verbose but should be a lot clearer than the pairsource2 stuff. Remember we went a very long way to define pairsource2 to make it work for this particular case.

gaow commented 5 years ago

good, this is what I thought too, but with s1 = group_by(output_from(1), 2). So we are on the same page here at least for the input. Now that with group_by = and from_steps = eliminated I dont feel too bad about leaving for_each taking its own convention (and leave it current convention).

Any thoughts on eliminating paired_with and group_with? Now that we have group_by as a function we can add parameters ...

gaow commented 5 years ago

Also I assume:

input: group_by(1)

will still work?

BoPeng commented 5 years ago

Maybe, have not thought of how to implement the new syntax. At least group_by=1 would be kept for a while, perhaps forever.

gaow commented 5 years ago

At least group_by=1 would be kept for a while, perhaps forever.

I guess all current mechanism can be kept forever for backwards compatibility but we do not document it? (thinking of lots of my pipelines already distributed to colleagues or out there for published work).

It will be a bit easier if we use keyword arguments, group_by(variable, size = ..., with = ..., ), eg, the following are allowed

input: variable # set variable to _input
input: group_by(variable, size = 1) # same as _input = group_by(variable, size = 1)
input: group_by(size = 1) # same as _input = group_by(output_from(previous_step), size = 1)

because i find it very attractive if we can remove group_by =, group_with = and paired_with = altogether.

BoPeng commented 5 years ago

A member function group_by is added to sos_targets that allows:

input: sos_targets('a.txt', 'b.txt').group_by(1)

and will support

input: output_of('step_2').group_by(1)

later.

The problem is how we should implement classical unnamed group_by. The new way

input: sos_targets('a.txt', 'b.txt').group_by(1)

is a bit long and needs explicit use of sos_targets.

Keep using

input: 'a.txt', 'b.txt', group_by=1

is possible but we are deprecating group_by=1.

There is another option for applying group_by(1) to nothing (namely the default step_input),

input: 'a.txt', 'b.txt', group_by(1)

but I am not sure if this is logically clear because we are adding an "action" as one of the input elements.

gaow commented 5 years ago

Previously I was suggesting:

input: variable # set variable to _input
input: group_by(variable, size = 1) # same as _input = group_by(variable, size = 1)
input: group_by(size = 1) # same as _input = group_by(output_from(previous_step), size = 1)

I think making it a class member function .group_by is even more confusing because we do not have other such class member functions.

BoPeng commented 5 years ago

OK, I was not particularly fond of

group_by(size=1)

because we have to use keyword argument here (I guess not a big deal?). The syntax also has the problem of

input: 'a.txt', 'b.txt', group_by(size=1)

because it is unclear if group_by should group a.txt, b.txt or output_from(previous_step). The latter case can be confusing but easier to implement, the former is difficult to implement because we are evaluating parameters referring to previous ones.

We can say

input: group_by('a.txt', 'b.txt', size=1)

is the right way to go but we have to explain what

input: 'a.txt', 'b.txt', group_by(size=1)

would do nevertheless.

BoPeng commented 5 years ago

Also, I would prefer

group(by='pair')

over

group_by(size='pair')
gaow commented 5 years ago

Well, I was suggesting not using

input: 'a.txt', 'b.txt', group_by(size=1)

at all, but stick to

input: group_by('a.txt', 'b.txt', size=1)

where positional arguments are the input. I also agree by is better than size. And we can possibly simply use group() because we have required keyword argument by anyways? Also we can think of possibly more keyword arguments to account for paired_with and group_with

gaow commented 5 years ago

or, sos_group to be consistent with sos_target

BoPeng commented 5 years ago

But again, if we allow both

input: 'a.txt', 'b.txt'

and

input: group(by=1)

we should expect

input: 'a.txt', 'b.txt', group(by=1)
gaow commented 5 years ago

Not if we do

input: 'a.txt', 'b.txt'

and

input: sos_group(by = 1)
...
input: sos_group('a.txt', 'b.txt', by = 1)

because users would view it as an enhanced sos_target specification.

BoPeng commented 5 years ago

@gaow This is what I have till now, let me know if you are ok with the syntax.

Note that I have not tested this on any of the real examples.

gaow commented 5 years ago

The former is a little easier to type but can only be used for groups with proper names. For example, only step_input["1"] can be used to access the group 1 of step_input because step_input.1 is not allows.

I was thinking we should get rid of the dot syntax input.grp1 altogether, for several reasons:

  1. The exceptions / caveats you have listed above
  2. pythonist argument that there should be only one obvious way to do it
  3. From other OOP languages, accessing what appears to be a class member data directly is not a good thing (use should member functions).
  4. Potential convention conflicts, eg what if a input group name is sources?
BoPeng commented 5 years ago

ok.

BoPeng commented 5 years ago

To conclude our named input feature, the SoS equivalence of

# Run the same task on the two splits
t1 = Task1()
t2 = Task1()
t3 = Task2(
        in1 = t1.outspec('out1')
        in2 = t1.outspec('out2')
        in3 = t2.outspec('out1')
        in4 = t2.outspec('out2'))

is

[t1]
output:  out1='t1_out1', out2='t1_out2'
[t2]
output:  out1='t2_out1', out2='t2_out2'
[t3]
input: 
        in1 = output_from('t1')['out1'],
        in2 = output_from('t1')['out2'],
        in3 = output_from('t2')['out1'],
        in4 = output_from('t2')['out2']
gaow commented 5 years ago

Do we allow for output_from to count backwards, eg output_from(-N) N = 1, 2, 3, .... ?

BoPeng commented 5 years ago

No. I had a look and found that it is difficult to implement it because the step does not know the existence of previous steps. Also, -1 is not particularly clear in the case of

[step_10]

[step_20]

since output_from(-1) can be step_19 or step_10.