pallassgj / bpipe

Automatically exported from code.google.com/p/bpipe
0 stars 1 forks source link

Please describe multiple file output #68

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I am trying to provide two files at input, and process them together through 
multiple steps in the pipeline.  In my case they are paired-end sequence reads, 
named something like "sampleA.1.fastq" and "sampleA.2.fastq".

I have tried to use the syntax for $inputs, but it only appears to work for the 
first step.  The $output is always only a single filename (at least for me.)  
Therefore, subsequent steps only get one $input file.

I also tried to use the produce() statement, to try to force the output to 
contain two filenames, but I'm not clear how to use it. I am not able to get a 
simple test case to work.

Can you describe how you would approach this problem? I have these steps:
-- trim reads (the trimming step uses both reads per pair)
-- subset reads (take only 'n' entries from each file)
-- map the paired reads together in the same bowtie command

But the question is more than that -- how would I for example, create two 
stranded coverage bedGraphs, e.g. one for '+' and one for '-'? That would 
involve two output files which I would want to keep linked together.

As to environment variable scope, it appears environment variables set in one 
stage of a pipeline are not available to subsequent stages in the pipeline. Is 
that true? I thought I could try to throw filenames from one step to the next 
(taking care to avoid parallel processing clashes) but it was problematic.

Thank you for any insights!

Original issue reported on code.google.com by jmw86...@gmail.com on 10 Jan 2013 at 10:03

GoogleCodeExporter commented 9 years ago
Update: I think I've got a better handle on how to deal with multiple files at 
input and output.  The key step is to use filter("outputExtension1", 
"outputExtension2") to create two output files with file extension that can be 
recognized downstream. The only issue is aesthetic, that the files are named 
"filename1.outputExtension1", "filename1.outputExtension2" instead of being 
named "filename1.outputExtension1", and "filename2.outputExtension2".
Then if I ran another similar 2-file filter, the new filenames would be 
"filename1.outputExtension1.newExtension1" and 
"filename1.outputExtension1.newExtension2". Kind of spoils the use of a 
filename audit trail, but at least I understand it.

One suggestion is to repeat the input names 'n' times, and repeat the subset 
suffix 'n' times, then paste them together, where 'n' is the number of output 
files requested.

I'm not clear exactly how things work yet, so apologies for posting an issue 
here rather than a discussion thread. I can move there if you prefer. :-)

Original comment by jmw86...@gmail.com on 14 Jan 2013 at 6:13