Open GoogleCodeExporter opened 9 years ago
dummy_task is called twice, once with in_file = "a.txt" and once with in_file =
"b.txt"
Do you mean that you would like to have 20 calls to dummy_task:
dummy_task("a.txt", "a.1.txt")
dummy_task("a.txt", "a.2.txt")
dummy_task("a.txt", "a.3.txt")
...
dummy_task("b.txt", "b.1.txt")
dummy_task("b.txt", "b.2.txt")
...
====================
Custom parameters
====================
You can do this by generating the parameters on the fly with @files
See http://www.ruffus.org.uk/decorators/files_ex.html
So:
import re
def custom_param():
re.sub('(.*)\.txt',
for in_file in ['a.txt', 'b.txt']:
for ii in range(10):
outfile_pattern = r'\1.{0}.tsv'.format(ii)
yield infile, re.sub('(.*)\.txt', outfile_pattern, in_file)
@files(custom_param)
def dummy_task(infile, outfile):
pass
This is maximally flexible if you don't know that each "a.txt" and "b.txt" will
be split into 10 until run time. custom_param() is a python function, and you
can do whatever you want, however you want. If you need to store state, either
use a python callable object or a closure like approach.
Andreas Heger has suggested / is contributing permutation/combination
decorators to Ruffus which may do exactly what you want without a custom
function but we still need to nail down the details.
Leo
Original comment by bunbu...@gmail.com
on 4 Jan 2013 at 5:08
Hi Leo,
Thanks for responding.
To clarify with a simpler example I would like something like the following to
run each output argument in parallel.
@split('a.txt', regex('(.*)\.txt'), [r'\1.{0}.txt'.format(i) for i in
range(10)])
def my_func(in_file, out_file):
pass
As written using split you would make a single call to my_func with arguments
in_file='a.txt' and out_file=['a.0.txt', ..., 'a.9.txt'].
What I want is a decroator to make multiple calls, i.e. my_func('a.txt',
'a.0.txt'), ..., my_func('a.txt', 'a.9.txt') in parallel so the jobs can run
concurrently.
With regards to the suggested approach, using @files and custom_param() works
if you now the inputs will be 'a.txt', 'b.txt'. How would that approach work if
the input names were variable?
Cheers,
Andy
Original comment by AndrewJL...@gmail.com
on 4 Jan 2013 at 6:38
Just to simplify the original example since the two inputs was not what I was
interested in.
Suppose I just want to use 'a.txt' as an input and generate 'a.0.tsv', ...,
'a.9.tsv'. That is I want to make 10 calls to parallalel_split which will be
run concurrently, parallalel_split('a.txt', 'a.0.tsv'), ...,
parallalel_split('a.txt', 'a.9.tsv').
The simpler code is
@split('a.txt', regex('(.*)\.txt'), [r'\1.{0}.tsv'.format(i) for i in
range(10)])
def dummy_task(in_file, out_file):
pass
@transform(dummy_task, regex('(.*)\.(\d+)\.tsv'), inputs(r'\1.txt'),
r'\1.\2.tsv')
def parallalel_split(in_file, out_file):
pass
Original comment by AndrewJL...@gmail.com
on 4 Jan 2013 at 6:42
The only way at the moment is to either generate the input / output files on
the fly (as per comment 2) or wait for the Andreas' contribution. This will
probably look like:
@all_vs_all(range(10), 'a.txt', [regex(r'(.*)\.txt'), regex(r'\d+')],
r'\1.\2.tsv'):
pass
The syntax is still undecided.
I don't think that extending @split syntax like you suggest is a good idea:
@split is already a bit too overloaded. I am rewriting the documentation to
reflect that we have 1->1 or 1->many or many->"even more" etc. operations in
Ruffus. And you can tell by counting the number of jobs upstream and downstream
of a particular task.
@split always is a 1->many (vanilla syntax) or many->"even more" (with regex)
operation. This is already too overloaded.
@transform is always a 1->1 operation.
You actual are asking for a 1->1 operation like transform but with special
syntax to generate the input and output parameters in some combinatorial
pattern.
Hope that helps.
Leo
Original comment by bunbu...@gmail.com
on 4 Jan 2013 at 6:53
Thanks Leo. The code I posted in comment 3 does exactly what I need, but I was
just hoping for a more elegant solution.
I agree split should not behave that way, I had in mind a new decorator for the
task with a syntax like split.
@parallel_split('a.txt', regex('(.*)\.txt'), [r'\1.{0}.tsv'.format(i) for i in
range(10)])
This would appear to the user a 1->many operation, though under the hood it
would be actually making numerous 1->1 calls in parallel. I will keep an eye
out for Andreas' work as it sound close to what I want.
Cheers,
Andy
Original comment by AndrewJL...@gmail.com
on 4 Jan 2013 at 7:01
Original comment by bunbu...@gmail.com
on 14 May 2014 at 10:32
Original issue reported on code.google.com by
AndrewJL...@gmail.com
on 17 Nov 2012 at 1:49