vatlab / sos

SoS workflow system for daily data analysis
http://vatlab.github.io/sos-docs
BSD 3-Clause "New" or "Revised" License
269 stars 45 forks source link

Support for named input #1106

Closed BoPeng closed 5 years ago

BoPeng commented 5 years ago

http://bionics.it/posts/workflows-dataflow-not-task-deps

# Run the same task on the two splits
t1 = Task1()
t2 = Task1()
t3 = Task2(
        in1 = t1.outspec('out1')
        in2 = t1.outspec('out2')
        in3 = t2.outspec('out1')
        in4 = t2.outspec('out2'))

So t3 gets input in1, in2, in3, in4 as step t1's out1, out2, out1, out2 respectivelyl.

What we have is

[t1]

[t2]

[t3]
input: from_steps=['t1', 't2']

and we cannot differentiate individual outputs from steps t1 and t2.

gaow commented 5 years ago

I see, if we do not keep track of the exact previous step it can be difficult to implement.

Then it makes filtering of input a bit harder eg in #1108 I was proposing doing it via:

input: sos_groups([x for x in output_from(-1) if os.stat(x).st_size > 0], by = 2)

how should we do it now? I was also proposing

input: sos_groups([x for x in from_output() if os.stat(x).st_size > 0], by = 2)

notice from_output() without any argument means output from previous step ...

BoPeng commented 5 years ago

I have not been able to attend to #1108, but can you just use the step number or name? If you want from_output(-1) be our default input from previous step, it is a smaller task and could possibly be done.

gaow commented 5 years ago

but can you just use the step number or name?

I certainly can, but it makes it harder to insert steps in between just like your example with step_10 and step_20 which might later change to step_19 and step_20.

If you want from_output(-1) be our default input from previous step, it is a smaller task and could possibly be done.

I would love to see it happen -- and is perhaps good for most cases with process oriented workflow where the next steps only cares about the previous one. Unless you can suggest a better way out for #1108 (using new or old syntax) which I see is a nice test for our design here.

gaow commented 5 years ago

Also, do you think from_output is a better name than output_from (assuming we can perl replace the code and documentation relatively easily)? Recall that we had an option from_steps before which we now discarded. from_output can be viewed as a generalization of from_steps.

BoPeng commented 5 years ago

I thought of output_of since it is shorter. I think output_from is better because it emphasizes on output, which is the type of returned stuff, and reads like 'output from step_10.from_output` does not read as smooth.

gaow commented 5 years ago

Okay I was maybe too much under the influence of from_steps... but how do you feel about sos_output? :)

gaow commented 5 years ago

Actually i'm cool with output_from. I agree it is the best name by itself unless we consider some common "themes" in naming our features. But I dont see it necesssary.

BoPeng commented 5 years ago

Not sure about sos_output since it is too general. sos_groups can be problematic because it feels like a sos type but it is actually a function that returns a sos_targets object with extra grouping information.

gaow commented 5 years ago

sos_groups effectively makes targets so it is not too much of a confusion, and it is not true that all sos types are sos_xyz eg path , executable etc. I think sos_groups and output_from are good enough.

gaow commented 5 years ago

@BoPeng how much does the name sos_groups bother you ? Compared to other possiblities such as groups()? Maybe groups is both easier to type and less confusing conceptually if you are strict about the use of sos_ prefix.

Also I just want to verify:

input: group_by = 1

is now

input: sos_groups(by = 1)

when the default is from the previous step? We do not have that example in the documentation but I think it is a reasonable default.

Also just a reminder of output_from(-1) whose usage is justified in #1108 .

gaow commented 5 years ago

Another thing to double check on: do we allow for multiple positional arguments in sos_groups:

input: sos_groups(output_from(1), output_from(2), by = 'paired2')

so that it gives me

out1['A'] out1['B'] out2['A'] out2['B']
out1['C'] out1['D'] out2['C'] out2['D']
...

etc?

BoPeng commented 5 years ago

I just had a meeting with Chris Wakefield and he had a very good point in that data flow is not the same as named inputs/outputs. Actually,

input: ref=output_from('step_a')['ref']

is not that different from

input: from_steps('step_a')

because we still need to refer to a step, which was something the reviewer disliked, and is not data flow driven workflow.

What data flow actually means can be something we already have

[A]
output: 'something'

[B]
input: 'something'

That is to say, we only need to specify data, and SoS connects steps. SoS has done this for a long time under the feature of auto-provides. Basically, for steps with pre-determinable output (not calculated from _input etc), it translates

[A]
output: 'whatever'

to

[A: autoprovides='whatever']
output: 'whatever'

so that the steps can be used as an auxiliary step in case whatever is needed.

Putting the technical details aside, I agree that

[A]
output: 'something'

[B]
input: 'something'

feels more like data flow and we should present it in our response.

Now, giving that our autoprovides stuff only support pre-determinable output, it is possible for us to do something like

[A]
output: something=f'{_input}_derived'

[B]
input:  data('something')

so that we can define something as a token for the data, and refer to the data, which is arguably a lot more flexible than output_from(step)['something'].

gaow commented 5 years ago

is not that different from

I agree. My view is as follows:

In our view the only limitation of this mechanism is the lack of named output -- in other words, `from_steps` assumes the default, unnamed output from other tasks, which is implicit and appears task focused rather than data flow oriented.

"this mechanism" refers to from_step.

data flow is not the same as named inputs/outputs.

Sure, but named inputs and outputs emphasize data flow more than using task dependencies.

I have to run now but will add more thoughts later ... maybe on the go

gaow commented 5 years ago

That is to say, we only need to specify data, and SoS connects steps.

This is what i meant to say with these:

This is true regardless of whether or not named input / output is used, but herein we provide another example using named input and output

so yes we have it all along.

so that we can define something as a token for the data, and refer to the data, which is arguably a lot more flexible than output_from(step)['something'].

I thought about proposing something like that in the beginning, but there will potentially ambiguity that multiple steps has something. If we make it an SoS variable then there is no need to do antying special about it. That's why I think we should just make a good case that data flow regardless of named input and output is what we have. But named input and output does help understanding, sepcifing and using data flow between tasks.

I think we are on the same page. It is just a matter of how to word these and give examples in the right places.

BoPeng commented 5 years ago

sos_variable is a variable, which is possible type of BaseTarget (output). I think the key here is to define data, probably independent of steps.

files = ['a.txt', 'b.txt']

[10]
input: files

has a flavor of this, so does

files = ['a.txt', 'b.txt']

[10]
output: files

but files is statically defined and is not quite useful. In contrast

[10]
output: files='whatever'

[20]
input: data('files')

makes input of step_20 be the files of whatever step. It is more useful and can be a lot easier to use than

[20]
input: output_from(10)['files']

I understand the ambiguity part because it is easy to do

[A]
output: out='a.txt'

[B]
output: out='b.txt'

and users have the confusing choices between

[C]
input: 'a.txt'

and

[C]
input: data('out`)

but it should be sufficient to yield an error for such cases, the same when we are trying to find a.txt through other cases (pattern matching, exact match etc).

The implementation is also easy because we already have the autoprovide stuff, and we only need to add something similar to autoprovide, with the provided stuff changed from targets to names of targets.

The actual problem is that SoS has gone a little bit too far since it allows many methods to add dependencies to a level that could be really confusing.

BoPeng commented 5 years ago

On the other hand, I think named data matching

[10]
output: out='something'

[20]
input: data('out')

is a natural extension to the ununamed data matching

[10]
output: 'something'

[20]
input: 'something'
BoPeng commented 5 years ago

Another thing to double check on: do we allow for multiple positional arguments in sos_groups: input: sos_groups(output_from(1), output_from(2), by = 'paired2')

Yes, a test case will be added later but output_from(1), output_from(2) should work and is equivalent to output_from([1, 2]), and should be used if you want to rename the objects (e.g. a=output_from(1), b=output_from(2) is different from a=output_from([1,2]).

BoPeng commented 5 years ago

input: sos_groups(by = 1)

This is not implemented yet.

gaow commented 5 years ago

but it should be sufficient to yield an error for such cases

Agreed.

So the proposed mechanism is completely independent of steps. But I think being explicit about steps also has its values -- because feels like data flow the way you proposed is at the borderline of Process Oriented. I still would like to see it as Process Oriented for reasons we discussed offline, but it is a bit implicit now without specifying step it comes from. Still it is explicit in the sense that the identifier files has no wildcard, unlike Make style which is completely implicit and the wildcard match can be quite hard to digest. So it looks like a powerful addition to output_from, not a replacement.

The actual problem is that SoS has gone a little bit too far since it allows many methods to add dependencies to a level that could be really confusing.

I tend to think and view this by flow styles and levels of explicitness. I think multi-style flow is one thing that distinguishes SoS from others. Surely we will have to document it well. Unless these mechanism have complicated the code to the extend that we can no longer be confident about correctness of implementation -- that case yes we should be careful about new features like this.

BoPeng commented 5 years ago

So any suggestion on the name? I used

data('name')

but

named_output('name')

can be more explicit.

BoPeng commented 5 years ago

I think in our response we should first list

[A]
output: 'somefile'

[B]
input: 'somefile'

as something we already have for data flow, but point out its limitation. Then we say we add named output like

[A]
output: name='somefile'

[B]
input: named_data('name')

to overcome the limitation.

Then we say we also support the style cited in the blog

input: output_from('step')['name']

in case there are multiple names from different steps.

gaow commented 5 years ago

Sounds great. But before that, shall we revisit the recent (and all functions) we have introduced or will introduce?

Is that all? (I'm avoiding discussion of paired_with etc for now). Let's see if we can have a better theme for these names.

BoPeng commented 5 years ago

Yes. If we adopt the set/get stuff, there is no need for paired_with and the original parameter would continue to work (if we do not intentionally deprecate it).

gaow commented 5 years ago

if we do not intentionally deprecate it

I'd suggest we do not do it for all of them if possible, including group_by ... As far as I know a substantial number of SoS workflows are already out there in production in various places.

Okay,

  1. as discussed before since sos_groups generates sos_target the use of sos_ prefix is justified? Otherwise input_groups and we have input_ and output_ prefix based function names.
  2. for data(), how about output_variable() which seems to fall in line with output_from? The issue is when we allow for multiple variables we have a grammar problem ...
BoPeng commented 5 years ago

Used named_output() for now because it describes the name most appropriately. I know there can be better choices than named_output and output_from but let us leave this for later.

Now for_each.

BoPeng commented 5 years ago

OK, with the new set/get stuff, it is possible to associate variables with sos_targets. So

sos_targets('a.txt').for_each('i', range(5))

says:

  1. create a group if there is no group.
  2. repeat the group five times.
  3. associate each group with a variable i, with values 0, 1, 2, 3, 4

Then we have _input as each group, and then _input.get('i') as the variable associated with sos_targets.

Of course _input.get('i') is a lot more difficult to use than i generated from for_each={'i': range(5)}....

BoPeng commented 5 years ago

Now, using the group_by convention, the syntax could be something like

for_each(sos_targets('a.txt'), i=range(5), j=samples)

which is only slightly better.

BoPeng commented 5 years ago

The problem with associating for_each with sos_targets (or sos_groups) is that for_each is supposed to be applied to the entire _input. When multiple parameters are present,

input: sos_groups(output_from(1)).for_each(whatever),
    sos_groups(output_from(2)).for_each(whatever)

It is very difficult to figure out what these for_each do.

Now, if we separate for_each with the rest of the parameters, we largely have the choices between the old style

for_each={'i': range(5), 'sample': samples}
for_each=dict(i=range(5), sample=samples)

and new style

i=for_each(range(5)), sample=for_each(samples)
for_each(i=range(5), sample=samples)
gaow commented 5 years ago

for_each is supposed to be applied to the entire _input.

Yes I agree! In addition to what you said, the associative law (does the dot order matter for set_each and for_each) is hard to process.

I vote for the old interface style, including keeping for_each = 'var', but still have the parameter accessible via eg _input.get('var'), and _var is a shorthand of it inside the current step ...

BoPeng commented 5 years ago

still have the parameter accessible via eg _input.get('var'), and _var is a shorthand of it inside the current step ...

Why do you need _input.get('var')?

gaow commented 5 years ago

Why do you need _input.get('var')?

I think I did not think it through ... I actually want something like _output.get('chunk'). Here is an example of what I"m currently editing:

[1]
chunks = <a list of strings>
input: sos_groups(by = 1), for_each = 'chunks', concurrent = True
...

[2]
input: sos_groups(by = 1), concurrent = True
output: summary_stats = f"{_input:n}.summary_stats.txt", ld_matrix = f"{_input:n}.LD.txt"
python: ...
   chunk = {_input.get('_chunks')}

Basically in step [1] I got an attribute for_each that generated my different output groups. I want that information to be carried on. Maybe I'm missing something obvious (that I can in fact pass it on different way but easier way)?

BoPeng commented 5 years ago

I see. this makes sense. So we need to expand for_each=... to assign the variables to groups.

BoPeng commented 5 years ago

The concept that BaseTarget and sos_targets can carry dictionaries is very powerful, but I am limiting the dictionary to basic types (bool, int, float, str, and list and tuple, dict of these types) because other types will make the targets not portable (e.g. need pickle across processes and needs the destination environment to understand them).

gaow commented 5 years ago

but I am limiting the dictionary to basic types

Yes I do not see at least from my current experience that I need something to violate this design in order to get things work the way I want -- hopefully my experience speaks for a reasonable size of potential user groups.

BoPeng commented 5 years ago

You actually want the for_each variables in not only _input, but also _output, which is unfortunately not possible because step_output is an accumulation of _output so even if I set variables to _output, they will not be kept intact in step_output.

It is possible to keep the groups information (and associated dictionaries) in step_output but that can cause some other issues since output_from(step) right now is "plain" sos_targets without groups.

gaow commented 5 years ago

Yes I understand. I was not thinking very clearly. Basically there is no way to carry on attributes to next steps. It is good to realize this limitation.

So then there is no point to have for_each be a class method due to the possible confusions that arises. Maybe we should keep for_each as is and sort out #1114