Closed BoPeng closed 5 years ago
I see, if we do not keep track of the exact previous step it can be difficult to implement.
Then it makes filtering of input a bit harder eg in #1108 I was proposing doing it via:
input: sos_groups([x for x in output_from(-1) if os.stat(x).st_size > 0], by = 2)
how should we do it now? I was also proposing
input: sos_groups([x for x in from_output() if os.stat(x).st_size > 0], by = 2)
notice from_output()
without any argument means output from previous step ...
I have not been able to attend to #1108, but can you just use the step number or name? If you want from_output(-1)
be our default input from previous step, it is a smaller task and could possibly be done.
but can you just use the step number or name?
I certainly can, but it makes it harder to insert steps in between just like your example with step_10
and step_20
which might later change to step_19
and step_20
.
If you want from_output(-1) be our default input from previous step, it is a smaller task and could possibly be done.
I would love to see it happen -- and is perhaps good for most cases with process oriented workflow where the next steps only cares about the previous one. Unless you can suggest a better way out for #1108 (using new or old syntax) which I see is a nice test for our design here.
Also, do you think from_output
is a better name than output_from
(assuming we can perl
replace the code and documentation relatively easily)? Recall that we had an option from_steps
before which we now discarded. from_output
can be viewed as a generalization of from_steps
.
I thought of output_of
since it is shorter. I think output_from
is better because it emphasizes on output
, which is the type of returned stuff, and reads like 'output from step_10.
from_output` does not read as smooth.
Okay I was maybe too much under the influence of from_steps
... but how do you feel about sos_output
? :)
Actually i'm cool with output_from
. I agree it is the best name by itself unless we consider some common "themes" in naming our features. But I dont see it necesssary.
Not sure about sos_output
since it is too general. sos_groups
can be problematic because it feels like a sos type but it is actually a function that returns a sos_targets
object with extra grouping information.
sos_groups
effectively makes targets so it is not too much of a confusion, and it is not true that all sos types are sos_xyz
eg path
, executable
etc. I think sos_groups
and output_from
are good enough.
@BoPeng how much does the name sos_groups
bother you ? Compared to other possiblities such as groups()
? Maybe groups
is both easier to type and less confusing conceptually if you are strict about the use of sos_
prefix.
Also I just want to verify:
input: group_by = 1
is now
input: sos_groups(by = 1)
when the default is from the previous step? We do not have that example in the documentation but I think it is a reasonable default.
Also just a reminder of output_from(-1)
whose usage is justified in #1108 .
Another thing to double check on: do we allow for multiple positional arguments in sos_groups
:
input: sos_groups(output_from(1), output_from(2), by = 'paired2')
so that it gives me
out1['A'] out1['B'] out2['A'] out2['B']
out1['C'] out1['D'] out2['C'] out2['D']
...
etc?
I just had a meeting with Chris Wakefield and he had a very good point in that data flow is not the same as named inputs/outputs. Actually,
input: ref=output_from('step_a')['ref']
is not that different from
input: from_steps('step_a')
because we still need to refer to a step, which was something the reviewer disliked, and is not data flow driven workflow.
What data flow actually means can be something we already have
[A]
output: 'something'
[B]
input: 'something'
That is to say, we only need to specify data, and SoS connects steps. SoS has done this for a long time under the feature of auto-provides
. Basically, for steps with pre-determinable output (not calculated from _input
etc), it translates
[A]
output: 'whatever'
to
[A: autoprovides='whatever']
output: 'whatever'
so that the steps can be used as an auxiliary step in case whatever
is needed.
Putting the technical details aside, I agree that
[A]
output: 'something'
[B]
input: 'something'
feels more like data flow and we should present it in our response.
Now, giving that our autoprovides
stuff only support pre-determinable output, it is possible for us to do something like
[A]
output: something=f'{_input}_derived'
[B]
input: data('something')
so that we can define something
as a token for the data, and refer to the data, which is arguably a lot more flexible than output_from(step)['something']
.
is not that different from
I agree. My view is as follows:
In our view the only limitation of this mechanism is the lack of named output -- in other words, `from_steps` assumes the default, unnamed output from other tasks, which is implicit and appears task focused rather than data flow oriented.
"this mechanism" refers to from_step
.
data flow is not the same as named inputs/outputs.
Sure, but named inputs and outputs emphasize data flow more than using task dependencies.
I have to run now but will add more thoughts later ... maybe on the go
That is to say, we only need to specify data, and SoS connects steps.
This is what i meant to say with these:
This is true regardless of whether or not named input / output is used, but herein we provide another example using named input and output
so yes we have it all along.
so that we can define something as a token for the data, and refer to the data, which is arguably a lot more flexible than output_from(step)['something'].
I thought about proposing something like that in the beginning, but there will potentially ambiguity that multiple steps has something
. If we make it an SoS variable then there is no need to do antying special about it. That's why I think we should just make a good case that data flow regardless of named input and output is what we have. But named input and output does help understanding, sepcifing and using data flow between tasks.
I think we are on the same page. It is just a matter of how to word these and give examples in the right places.
sos_variable
is a variable, which is possible type of BaseTarget
(output
). I think the key here is to define data
, probably independent of steps.
files = ['a.txt', 'b.txt']
[10]
input: files
has a flavor of this, so does
files = ['a.txt', 'b.txt']
[10]
output: files
but files
is statically defined and is not quite useful. In contrast
[10]
output: files='whatever'
[20]
input: data('files')
makes input of step_20
be the files
of whatever step. It is more useful and can be a lot easier to use than
[20]
input: output_from(10)['files']
I understand the ambiguity part because it is easy to do
[A]
output: out='a.txt'
[B]
output: out='b.txt'
and users have the confusing choices between
[C]
input: 'a.txt'
and
[C]
input: data('out`)
but it should be sufficient to yield an error for such cases, the same when we are trying to find a.txt
through other cases (pattern matching, exact match etc).
The implementation is also easy because we already have the autoprovide
stuff, and we only need to add something similar to autoprovide
, with the provided stuff changed from targets to names of targets.
The actual problem is that SoS has gone a little bit too far since it allows many methods to add dependencies to a level that could be really confusing.
On the other hand, I think named data matching
[10]
output: out='something'
[20]
input: data('out')
is a natural extension to the ununamed data matching
[10]
output: 'something'
[20]
input: 'something'
Another thing to double check on: do we allow for multiple positional arguments in sos_groups: input: sos_groups(output_from(1), output_from(2), by = 'paired2')
Yes, a test case will be added later but output_from(1), output_from(2)
should work and is equivalent to output_from([1, 2])
, and should be used if you want to rename the objects (e.g. a=output_from(1), b=output_from(2)
is different from a=output_from([1,2])
.
input: sos_groups(by = 1)
This is not implemented yet.
but it should be sufficient to yield an error for such cases
Agreed.
So the proposed mechanism is completely independent of steps. But I think being explicit about steps also has its values -- because feels like data flow the way you proposed is at the borderline of Process Oriented. I still would like to see it as Process Oriented for reasons we discussed offline, but it is a bit implicit now without specifying step it comes from. Still it is explicit in the sense that the identifier files
has no wildcard, unlike Make style which is completely implicit and the wildcard match can be quite hard to digest. So it looks like a powerful addition to output_from
, not a replacement.
The actual problem is that SoS has gone a little bit too far since it allows many methods to add dependencies to a level that could be really confusing.
I tend to think and view this by flow styles and levels of explicitness. I think multi-style flow is one thing that distinguishes SoS from others. Surely we will have to document it well. Unless these mechanism have complicated the code to the extend that we can no longer be confident about correctness of implementation -- that case yes we should be careful about new features like this.
So any suggestion on the name? I used
data('name')
but
named_output('name')
can be more explicit.
I think in our response we should first list
[A]
output: 'somefile'
[B]
input: 'somefile'
as something we already have for data flow, but point out its limitation. Then we say we add named output like
[A]
output: name='somefile'
[B]
input: named_data('name')
to overcome the limitation.
Then we say we also support the style cited in the blog
input: output_from('step')['name']
in case there are multiple name
s from different steps.
Sounds great. But before that, shall we revisit the recent (and all functions) we have introduced or will introduce?
sos_groups()
data()
or named_output()
output_from()
Is that all? (I'm avoiding discussion of paired_with
etc for now). Let's see if we can have a better theme for these names.
Yes. If we adopt the set
/get
stuff, there is no need for paired_with
and the original parameter would continue to work (if we do not intentionally deprecate it).
if we do not intentionally deprecate it
I'd suggest we do not do it for all of them if possible, including group_by
... As far as I know a substantial number of SoS workflows are already out there in production in various places.
Okay,
sos_groups
generates sos_target
the use of sos_
prefix is justified? Otherwise input_groups
and we have input_
and output_
prefix based function names. data()
, how about output_variable()
which seems to fall in line with output_from
? The issue is when we allow for multiple variables we have a grammar problem ... Used named_output()
for now because it describes the name most appropriately. I know there can be better choices than named_output
and output_from
but let us leave this for later.
Now for_each
.
OK, with the new set/get
stuff, it is possible to associate variables with sos_targets
. So
sos_targets('a.txt').for_each('i', range(5))
says:
i
, with values 0, 1, 2, 3, 4Then we have _input
as each group, and then _input.get('i')
as the variable associated with sos_targets
.
Of course _input.get('i')
is a lot more difficult to use than i
generated from for_each={'i': range(5)}
....
Now, using the group_by
convention, the syntax could be something like
for_each(sos_targets('a.txt'), i=range(5), j=samples)
which is only slightly better.
The problem with associating for_each
with sos_targets
(or sos_groups
) is that for_each
is supposed to be applied to the entire _input
. When multiple parameters are present,
input: sos_groups(output_from(1)).for_each(whatever),
sos_groups(output_from(2)).for_each(whatever)
It is very difficult to figure out what these for_each
do.
Now, if we separate for_each
with the rest of the parameters, we largely have the choices between the old style
for_each={'i': range(5), 'sample': samples}
for_each=dict(i=range(5), sample=samples)
and new style
i=for_each(range(5)), sample=for_each(samples)
for_each(i=range(5), sample=samples)
for_each is supposed to be applied to the entire _input.
Yes I agree! In addition to what you said, the associative law (does the dot order matter for set_each
and for_each
) is hard to process.
I vote for the old interface style, including keeping for_each = 'var'
, but still have the parameter accessible via eg _input.get('var')
, and _var
is a shorthand of it inside the current step ...
still have the parameter accessible via eg _input.get('var'), and _var is a shorthand of it inside the current step ...
Why do you need _input.get('var')
?
Why do you need _input.get('var')?
I think I did not think it through ... I actually want something like _output.get('chunk')
. Here is an example of what I"m currently editing:
[1]
chunks = <a list of strings>
input: sos_groups(by = 1), for_each = 'chunks', concurrent = True
...
[2]
input: sos_groups(by = 1), concurrent = True
output: summary_stats = f"{_input:n}.summary_stats.txt", ld_matrix = f"{_input:n}.LD.txt"
python: ...
chunk = {_input.get('_chunks')}
Basically in step [1]
I got an attribute for_each
that generated my different output groups. I want that information to be carried on. Maybe I'm missing something obvious (that I can in fact pass it on different way but easier way)?
I see. this makes sense. So we need to expand for_each=...
to assign the variables to groups.
The concept that BaseTarget
and sos_targets
can carry dictionaries is very powerful, but I am limiting the dictionary to basic types (bool
, int
, float
, str
, and list
and tuple
, dict
of these types) because other types will make the targets not portable (e.g. need pickle
across processes and needs the destination environment to understand them).
but I am limiting the dictionary to basic types
Yes I do not see at least from my current experience that I need something to violate this design in order to get things work the way I want -- hopefully my experience speaks for a reasonable size of potential user groups.
You actually want the for_each
variables in not only _input
, but also _output
, which is unfortunately not possible because step_output
is an accumulation of _output
so even if I set variables to _output
, they will not be kept intact in step_output
.
It is possible to keep the groups information (and associated dictionaries) in step_output
but that can cause some other issues since output_from(step)
right now is "plain" sos_targets
without groups.
Yes I understand. I was not thinking very clearly. Basically there is no way to carry on attributes to next steps. It is good to realize this limitation.
So then there is no point to have for_each
be a class method due to the possible confusions that arises. Maybe we should keep for_each
as is and sort out #1114
http://bionics.it/posts/workflows-dataflow-not-task-deps
So
t3
gets inputin1
,in2
,in3
,in4
as stept1
'sout1
,out2
,out1
,out2
respectivelyl.What we have is
and we cannot differentiate individual outputs from steps
t1
andt2
.