Closed BoPeng closed 5 years ago
We do have
[t3]
input: {
'name1': filelist1
'name2': filelist2
}
but name1
and name2
cannot be used directly and have to be used through _input['name1']
etc.
Our implementation of input
uses positional arguments for file list, but this can be relaxed as
[t3]
input: name1=filelist1, name2=filelist2
as long as name1
etc do not conflict with our existing options such as group_by
.
Also, we could define name1
and name2
in addition to _input
(but still allows _input['name1']
etc because our group_by
etc will depend on _input
.
The bigger problem is with the out because we do not have named output yet.
Suggestion:
Allow syntax
[t3]
input: name1 = filelist1, name2=filelist2
Allow access of sources as
_input.name1, _input.name2, ...
and abandon the existing _input['name1']
and _input['name2']
which is a lot harder to type.
from_steps
Remove keyword option from_steps
and make it a function, something like
input: output_of('step')
or
input: name1=output_of('step2')
output: fastq='something', summary='result.txt'
This is easy enough to do when there is no substeps, and we can allow
_output
as all output, and
_output.fastq
as one of the output.
Now, when we have named output, we can have
step_output.fastq
as accumulation of _ouput.fastq
.
but name1 and name2 cannot be used directly and have to be used through _input['name1'] etc.
What is the problem with this approach? I think we have named input in place by that definition already, and we only worry about named output?
Also we already have,
[t3]
name1=filelist1
name2=filelist2
input: name1, name2
right? Now when group_by
is used, these names are not directly usable in a meaningful way in a workflow step. But there are many cases without group_by
that our existing approaches can already handle for the input part. I doubt we need new syntax to use these variables in a step.
You are right about
[t3]
name1=filelist1
name2=filelist2
input: name1, name2
The proposed syntax
[t3]
input: name1=filelist1, name2=filelist2
will allow _input['name1']
but I am not sure if we should expose name1
directly in this case. snakemake seems to be using _input.name1
which is more succinct.
In any case, we will need to have a single _input
to make group_by
work, but we might have to expand group_by
to allow
input: fastq='*.fastq', ref='ref.fa', group_fastq_by='1'
sh:
{_input.fastq} {_input.ref}
so that only fastq
is grouped, not ref
.
_input.name1, _input.name2, ...
Okay I'm cool with that (in fact neutral), as long as we still have _input
in there which makes the syntax very clear. I do not think _input['name1']
is too bad.
extend from_steps
Conceptually it makes it hard (at least for me) to think about group_by
behavior based on source -- how do we deal about it then? We also have group_by
in output
. It feels like group_by
behavior has to be changed too.
Oh okay seems our messages are crossing but thoughts are converging ... I think group_fastq_by='1'
you said above is more like some for_each
extension. In light of that, maybe along the lines of group_by = ('fastq', 1)
?
Yes, we have kept the group_by
option to a single parameter, and it seems ok to allow group_by=('fastq', 1)
. I wrote group_fastq_by
because it reads better and writes better, and we are the creator and can make it happen.
Things can get complicated. For example in our code example you might want to group the first named source by 2 and second named source by 1, so something like
input: out1=output_from('step1'), out2=output_from('step2'),
group_out1_by=2, group_out2_by=1
which is actually not too bad...
sure, I agree it reads good by itself. But do we want to be somewhat consistent with other options? eg for_each
is an option that takes a string, a list and dictionary {(n,p): [_n < _p for _n in n for _p in p]}
. The last one is certainly flexible but does not read better. But they are all for_each
. Maybe it is easier for users to learn, if we can take this opportunities to somewhat unify flavors of for_each
and group_by
rather than making them more different. We can also remove some of the paring based on source syntax.
Yes we need to think through these. These options were simple at first and it is getting more and more complicated, but hopefully we can keep them easy to use.
But
for_each_i=range(5)
does not read quite good because we have the simplified version of
for_each='i'
for _i
.
I felt that the use of for_each='i'
, which generates _i
is a bit magically and should be deprecated. In all other flavor of loops we have for_each
over a loop, so perhaps
samples = ('a', 'b')
for_each_sample = samples
is not too bad because at least we introduced a way to explicitly set variable name.
so maybe we keep to having only for_each
and group_by
, but make these two very similar in flavor for various extensions. Also think of some really simple but common users cases (eg like from that blog) and make sure our mechanism does not make simple cases more complicated?
I felt that the use of for_each='i', which generates _i is a bit magically and should be deprecated.
but I think conceptually we have one for_each
which takes string, list or a dictionary. This is intuitive to remember. for_each_sample
constructs the LHS of a statement which indeed simplified the RHS, but is not that intuitive on tthe LHS because I expect programmers usually would like to focus on the RHS however ugly they are ...
Also when typed in full, the existing:
i = range(5)
for_each = 'i'
and the proposed:
for_each_i=range(5)
the proposed made the parameter i
less explicit -- are you cool with that? I think it is better to be more explicit by at least placing on the RHS.
I know both arguments. I have written
input: for_each={'i': range(5)}
for so many times for testing and I had hopefully we could make it easier. We can introduce the following
input: for_each_i=range(5)
as a syntax sugar for these kinds of thing but I know we pythonist should not crave for it.
i = range(5)
for_each = 'i'
is not good because
_i
is our convention).'i'
to refer to a variable name i
is not quite natural because as I have said programmers are familiar with for_each
over a loop, not a name.This is why I have preferred to use
for_each={'i': range(5)}
to explicitly set variable name and things to iterate over, but it is just awkward to type it.
is not good because
I should have written:
parameter: i = range(5)
for_each = 'i'
with the potential to change i
from command input. Of course interally we can still have for_each = {'i': i}
.
as a syntax sugar for these kinds of thing but I know we pythonist should not crave for it.
Since SoS positions itself Python-based thus the low learning curve, maybe we want to be consistent about that?
Perhaps something like:
_i = for_each(i)
sample = for_each(samples)
and we tell user that for_each
is some sort of generator and _i
and sample
would take one value each time?
The key here is that using a keyword parameter syntax users know that _i
and sample
are names.
We can also have something in effective of
fastq=group_by(step_input.fastq, 2)
So in the end we have a bunch of input
-only functions (I would not make it available outside of input
) that creates names
that can be used in the content of the step.
So in the end we have a bunch of input-only functions
But to me you are proposing for_each()
to give a new data type, a generator, that we know conceptually will be used for input and naturally extends to output, implicitly. For example:
parameter: samples = list()
input: sample = for_each(samples)
bash:
echo {sample}
will run the bash script len(samples)
times since sample
is a generator? It is way too subtle to grasp.
Updated your example.
So as I said, the keyword arguments would generate names that can be used in the content, with the meaning of names determined by the RHS.
For example
input: fastq=output_from('step_2')
would allow the use of _input.fastq
(even fastq
if we decide so) in the script.
input: _i = for_each(range(5))
would allow the use of _i
in the script.
input: fastq=group_by(output_from('step2'), 1)
would allow the use of fastq
as one of the output of step2
one by one.
Right now I do not see anything particularly difficult to digest, and we can keep our group_by
options for a while.
so output_from
and group_by
will return file targets, but for_each
will return parameters? The LHS does not reflect these differences. It still feels subtle to me. What would this be like in the new syntax, for example?
parameter: filename = str # or path()
parameter: a = list
parameter: b = list
input: filename, for_each = ['a', 'b'], group_by = 1
From current syntax it is at least clear that group_by happens after for_each parameter and all accompanying filename
.
Something like
parameter: filename = str # or path()
parameter: a = list
parameter: b = list
input:
_a=for_each(a),
_b=for_each(b),
_input=group_by(filename, 1)
if we use pure new syntax? Or if we consider default input,
parameter: filename = str # or path()
parameter: a = list
parameter: b = list
input:
_a=for_each(a),
_b=for_each(b),
_input=group_by(1, filenames)
I understand it is difficult to achieve nested for_each
in this way.
well, I guess this is what I meant earlier by trying to not make simple case more complicated. The new syntax seems to complicate things, and it is not very clear that group_by
will loop over each a
and b
for input file filename
.
But at leas it is more explicit, right? There is less guess about _a
, _b
, and _input
, and at least in the case of
input: for_each={'i': range(5)}
the new syntax
input: i=for_each(range(5))
is more natural because it reads like i
is taking one at a time of an array.
And the explicitly helps in more complex cases such as
input:
fastq=group_by(output_from('step1'), 2),
bam=group_by(output_from('step2'), 1),
ref='reference'
sh: expand=True
do something with {fastq} and {bam} with {ref}
because each piece of input is defined by itself.
And the explicitly helps in more complex cases such as
Certainly ... but in this example the LHS of =
are all input files. Mixing in i = for_each()
is still awkward. Should one expect to access i
via _input.i
?
Also does it help to also outline the plan for named output and see if the theme match?
In terms of output, we can do something like:
input: fastq=group_by('*.fastq', 1)
output: bam=fastq.with_suffix('.bam')
Here fastq
is similar to _input
, and we allow the assignment of _output
to bam
.
I think we can keep the use of _input
and _output
for "default" input and output but use these syntax for named input and output, although it is currently unclear how to mix them together. For example,
output_from('step').bam
would be named output, and
output_from(step)
would be _output
, but what is _output
with the presence of bam
?
If we keep our current _input
(and _output
) + sources
implementation,
output_from('step')
would be the complete _output
(actually step_output
), and
output_from('step').bam
would be step_output.bam
. and step_output
still represents multiple outputs if there are multiple named outputs. The same goes to _input
.
_input.i
is actually not a bad idea but it allows us to pass all information belong to an input group together. In this case,
sample_names=['sample1', 'sample2']
input: 'ref.fa', fastq=group_by(['f1.fastq', 'f2.fastq'], 1), a = for_each(sample_names)
we can have an _input
with components
step_input = 'ref.fa', 'f1.fastq', 'f2.fastq'
step_input.fastq = 'f1.fastq', 'f2.fastq' .
_input = 'ref.fa', 'f1.fastq'
_input.fastq = 'f1.fastq' # the same as _input['fastq']
_input[''] = 'ref.fa` # if really needed
_input.a = 'sample1'
It also avoids the problem with
a = range(5)
input: a=for_each(a)
because the loop a
would be _input.a
, and using some magic we can write
a = range(5)
input: for_each(a), group_by('*.fastq')
where for_each(a)
would assign to _input.a
and group_by('*.fastq') would assign to
_input`.
In generally what we did for from_steps
and _input['source']
had named input and output almost there because from_steps='a'
refers to output of a step, not just a step, there will be a lot of work for the new syntax though.
would be _output, but what is _output with the presence of bam?
Maybe it is less confusing to only allow for [0]
or ['bam']
syntax, not the dot syntax. And _input
and _output
will be a flat list of values, as if they are values from ordered dict?
for_each(a)
I'm cool with this syntax, and using for_each(a,b,c)
if possible. but how about our old syntax: for_each = {(n,p): [(_n, _p) if _n < _p ...]}
?
I guess I'm still not convinced or motivated we should change behavior of for_each
, although we should try figure out what to do with group_by
in light of named input and output.
The for_each
stuff handles variables, which can be completely separated from the named input/output issues, so the only thing we need to worry about is group_by
.
Whereas
input: unnamed, name1=whatever, name2=whatever2, group_by=('name1', 2)
would work,
input: unnamed, name1=group_by(whatever, 2), name2=whatever2
seems to make more sense, although implementation-wise this syntax makes it difficult to determine step_input
since this syntax technically specifies _input.name1
directly.
The for_each stuff handles variables, which can be completely separated from the named input/output issues, so the only thing we need to worry about is group_by.
agreed. Also agreed that the 2nd syntax is more intuitive. And we replace from_steps =
to output_from()
? How would you translate the example
[lasso_3, ridge_3]
input: from_steps = [1, 2], group_by = 'pairsource2', concurrent = True
output: f"{_input[2]:nn}.mse.csv"
to the new interface? (not simply pairing, but by groups of 2 here)
Not sure how to group, but would be something like
[lasso_3, ridge_3]
input: s1=group_by(output_from(1), 1),
s2=group_by(output_from(2), 2), concurrent = True
output: f"{_input.s1:nn}.mse.csv"
The syntax is more verbose but should be a lot clearer than the pairsource2
stuff. Remember we went a very long way to define pairsource2
to make it work for this particular case.
good, this is what I thought too, but with s1 = group_by(output_from(1), 2)
. So we are on the same page here at least for the input. Now that with group_by =
and from_steps =
eliminated I dont feel too bad about leaving for_each
taking its own convention (and leave it current convention).
Any thoughts on eliminating paired_with
and group_with
? Now that we have group_by
as a function we can add parameters ...
Also I assume:
input: group_by(1)
will still work?
Maybe, have not thought of how to implement the new syntax. At least group_by=1
would be kept for a while, perhaps forever.
At least group_by=1 would be kept for a while, perhaps forever.
I guess all current mechanism can be kept forever for backwards compatibility but we do not document it? (thinking of lots of my pipelines already distributed to colleagues or out there for published work).
It will be a bit easier if we use keyword arguments, group_by(variable, size = ..., with = ..., )
, eg, the following are allowed
input: variable # set variable to _input
input: group_by(variable, size = 1) # same as _input = group_by(variable, size = 1)
input: group_by(size = 1) # same as _input = group_by(output_from(previous_step), size = 1)
because i find it very attractive if we can remove group_by =
, group_with =
and paired_with =
altogether.
A member function group_by
is added to sos_targets
that allows:
input: sos_targets('a.txt', 'b.txt').group_by(1)
and will support
input: output_of('step_2').group_by(1)
later.
The problem is how we should implement classical unnamed group_by
. The new way
input: sos_targets('a.txt', 'b.txt').group_by(1)
is a bit long and needs explicit use of sos_targets
.
Keep using
input: 'a.txt', 'b.txt', group_by=1
is possible but we are deprecating group_by=1
.
There is another option for applying group_by(1)
to nothing (namely the default step_input
),
input: 'a.txt', 'b.txt', group_by(1)
but I am not sure if this is logically clear because we are adding an "action" as one of the input elements.
Previously I was suggesting:
input: variable # set variable to _input
input: group_by(variable, size = 1) # same as _input = group_by(variable, size = 1)
input: group_by(size = 1) # same as _input = group_by(output_from(previous_step), size = 1)
I think making it a class member function .group_by
is even more confusing because we do not have other such class member functions.
OK, I was not particularly fond of
group_by(size=1)
because we have to use keyword argument here (I guess not a big deal?). The syntax also has the problem of
input: 'a.txt', 'b.txt', group_by(size=1)
because it is unclear if group_by
should group a.txt
, b.txt
or output_from(previous_step)
. The latter case can be confusing but easier to implement, the former is difficult to implement because we are evaluating parameters referring to previous ones.
We can say
input: group_by('a.txt', 'b.txt', size=1)
is the right way to go but we have to explain what
input: 'a.txt', 'b.txt', group_by(size=1)
would do nevertheless.
Also, I would prefer
group(by='pair')
over
group_by(size='pair')
Well, I was suggesting not using
input: 'a.txt', 'b.txt', group_by(size=1)
at all, but stick to
input: group_by('a.txt', 'b.txt', size=1)
where positional arguments are the input. I also agree by
is better than size
. And we can possibly simply use group()
because we have required keyword argument by
anyways? Also we can think of possibly more keyword arguments to account for paired_with
and group_with
or, sos_group
to be consistent with sos_target
But again, if we allow both
input: 'a.txt', 'b.txt'
and
input: group(by=1)
we should expect
input: 'a.txt', 'b.txt', group(by=1)
Not if we do
input: 'a.txt', 'b.txt'
and
input: sos_group(by = 1)
...
input: sos_group('a.txt', 'b.txt', by = 1)
because users would view it as an enhanced sos_target
specification.
@gaow This is what I have till now, let me know if you are ok with the syntax.
Note that I have not tested this on any of the real examples.
The former is a little easier to type but can only be used for groups with proper names. For example, only step_input["1"] can be used to access the group 1 of step_input because step_input.1 is not allows.
I was thinking we should get rid of the dot syntax input.grp1
altogether, for several reasons:
sources
? ok.
To conclude our named input feature, the SoS equivalence of
# Run the same task on the two splits
t1 = Task1()
t2 = Task1()
t3 = Task2(
in1 = t1.outspec('out1')
in2 = t1.outspec('out2')
in3 = t2.outspec('out1')
in4 = t2.outspec('out2'))
is
[t1]
output: out1='t1_out1', out2='t1_out2'
[t2]
output: out1='t2_out1', out2='t2_out2'
[t3]
input:
in1 = output_from('t1')['out1'],
in2 = output_from('t1')['out2'],
in3 = output_from('t2')['out1'],
in4 = output_from('t2')['out2']
Do we allow for output_from
to count backwards, eg output_from(-N)
N = 1, 2, 3, ....
?
No. I had a look and found that it is difficult to implement it because the step does not know the existence of previous steps. Also, -1
is not particularly clear in the case of
[step_10]
[step_20]
since output_from(-1)
can be step_19
or step_10
.
http://bionics.it/posts/workflows-dataflow-not-task-deps
So
t3
gets inputin1
,in2
,in3
,in4
as stept1
'sout1
,out2
,out1
,out2
respectivelyl.What we have is
and we cannot differentiate individual outputs from steps
t1
andt2
.