jigneshvasoya / ruffus

Automatically exported from code.google.com/p/ruffus
MIT License
0 stars 0 forks source link

Run split in parallell #51

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
It would be very useful if there existed way to split files and run them in 
parallel. In the simplest case this would be a 1 -> many operation, with each 
output file being passed in a separate call to the wrapped function. @split 
cannot do this as it passes the entire list of output files into one call.

There are several ways I have found to do this in ruffus by using multiple 
steps, but it would be greater to have another wrapper function to explicitly 
handle this. 

To be clear what I mean the following two step procedure accomplished what I 
want without creating extra files is as follows.

@split(['a.txt', 'b.txt'], regex('(.*)\.txt'), [r'\1.{0}.tsv'.format(i) for i 
in range(10)])
def dummy_task(in_file, out_file):
    pass

@transform(dummy_task, regex('(.*)\.(\d+)\.tsv'), inputs(r'\1.txt'), 
r'\1.\2.tsv')
def parallalel_split(in_file, out_file):
    pass

Thanks,
Andy

Original issue reported on code.google.com by AndrewJL...@gmail.com on 17 Nov 2012 at 1:49

GoogleCodeExporter commented 9 years ago
dummy_task is called twice, once with in_file = "a.txt" and once with in_file = 
"b.txt"

Do you mean that you would like to have 20 calls to dummy_task:

dummy_task("a.txt", "a.1.txt")
dummy_task("a.txt", "a.2.txt")
dummy_task("a.txt", "a.3.txt") 
...
dummy_task("b.txt", "b.1.txt")
dummy_task("b.txt", "b.2.txt")
...

====================
Custom parameters
====================
You can do this by generating the parameters on the fly with @files
See http://www.ruffus.org.uk/decorators/files_ex.html

So:
import re
def custom_param():
    re.sub('(.*)\.txt', 
    for in_file in ['a.txt', 'b.txt']:
        for ii in range(10):
            outfile_pattern = r'\1.{0}.tsv'.format(ii)
            yield infile, re.sub('(.*)\.txt', outfile_pattern, in_file)

@files(custom_param)
def dummy_task(infile, outfile):
   pass

This is maximally flexible if you don't know that each "a.txt" and "b.txt" will 
be split into 10 until run time. custom_param() is a python function, and you 
can do whatever you want, however you want. If you need to store state, either 
use a python callable object or a closure like approach.

Andreas Heger has suggested / is contributing permutation/combination 
decorators to Ruffus which may do exactly what you want without a custom 
function but we still need to nail down the details.

Leo

Original comment by bunbu...@gmail.com on 4 Jan 2013 at 5:08

GoogleCodeExporter commented 9 years ago
Hi Leo,
Thanks for responding. 

To clarify with a simpler example I would like something like the following to 
run each output argument in parallel.

@split('a.txt', regex('(.*)\.txt'), [r'\1.{0}.txt'.format(i) for i in 
range(10)])
def my_func(in_file, out_file):
    pass

As written using split you would make a single call to my_func with arguments 
in_file='a.txt' and out_file=['a.0.txt', ..., 'a.9.txt']. 

What I want is a decroator to make multiple calls, i.e. my_func('a.txt', 
'a.0.txt'), ..., my_func('a.txt', 'a.9.txt') in parallel so the jobs can run 
concurrently. 

With regards to the suggested approach, using @files and custom_param() works 
if you now the inputs will be 'a.txt', 'b.txt'. How would that approach work if 
the input names were variable?

Cheers,
Andy

Original comment by AndrewJL...@gmail.com on 4 Jan 2013 at 6:38

GoogleCodeExporter commented 9 years ago
Just to simplify the original example since the two inputs was not what I was 
interested in. 

Suppose I just want to use 'a.txt' as an input and generate 'a.0.tsv', ..., 
'a.9.tsv'. That is I want to make 10 calls to parallalel_split which will be 
run concurrently, parallalel_split('a.txt', 'a.0.tsv'), ..., 
parallalel_split('a.txt', 'a.9.tsv').

The simpler code is

@split('a.txt', regex('(.*)\.txt'), [r'\1.{0}.tsv'.format(i) for i in 
range(10)])
def dummy_task(in_file, out_file):
    pass

@transform(dummy_task, regex('(.*)\.(\d+)\.tsv'), inputs(r'\1.txt'), 
r'\1.\2.tsv')
def parallalel_split(in_file, out_file):
    pass

Original comment by AndrewJL...@gmail.com on 4 Jan 2013 at 6:42

GoogleCodeExporter commented 9 years ago
The only way at the moment is to either generate the input / output files on 
the fly (as per comment 2) or wait for the Andreas' contribution. This will 
probably look like:
@all_vs_all(range(10), 'a.txt', [regex(r'(.*)\.txt'), regex(r'\d+')], 
r'\1.\2.tsv'):
    pass

The syntax is still undecided.

I don't think that extending @split syntax like you suggest is a good idea:
@split is already a bit too overloaded. I am rewriting the documentation to 
reflect that we have 1->1 or 1->many or many->"even more" etc. operations in 
Ruffus. And you can tell by counting the number of jobs upstream and downstream 
of a particular task.

@split always is a 1->many (vanilla syntax) or many->"even more" (with regex) 
operation. This is already too overloaded.

@transform is always a 1->1 operation.

You actual are asking for a 1->1 operation like transform but with special 
syntax to generate the input and output parameters in some combinatorial 
pattern.

Hope that helps.

Leo

Original comment by bunbu...@gmail.com on 4 Jan 2013 at 6:53

GoogleCodeExporter commented 9 years ago
Thanks Leo. The code I posted in comment 3 does exactly what I need, but I was 
just hoping for a more elegant solution.

I agree split should not behave that way, I had in mind a new decorator for the 
task with a syntax like split. 

@parallel_split('a.txt', regex('(.*)\.txt'), [r'\1.{0}.tsv'.format(i) for i in 
range(10)])

This would appear to the user a 1->many operation, though under the hood it 
would be actually making numerous 1->1 calls in parallel. I will keep an eye 
out for Andreas' work as it sound close to what I want.

Cheers,
Andy

Original comment by AndrewJL...@gmail.com on 4 Jan 2013 at 7:01

GoogleCodeExporter commented 9 years ago

Original comment by bunbu...@gmail.com on 14 May 2014 at 10:32