pallassgj / bpipe

Automatically exported from code.google.com/p/bpipe
0 stars 1 forks source link

Merged job logging of parallel commands. #42

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.  Run a pipeline with parallel commands.
2.  Look at the screen log output.
3.

What is the expected output? What do you see instead?

I expect to see the output of each command in a single file somewhere in the 
.bpipe directory.  Instead, I see only one log file for the entire run, with 
output of all of the commands.  This is fine for sequential pipelines, but for 
pipelines that have jobs running in parallel, this is a problem.  For example, 
a pipeline that runs Cufflinks as well as OverdispersedRegionScanSeqs (USeq 
package) in parallel, I see this type of thing when the outputs intersect with 
each other:

[16:18:02] Modeling fragment count overdispersion.
[16:18:02] Calculating initial abundance estimates for bias correction.<----- 
[From Cufflinks]
> Processed 16024 loci.                        [*************************] 100% 
chr17random chr14random chrY chrX<----- [From OverdispersedRegionScanSeqs]
[16:38:19] Learning bias parameters.<----- [From Cufflinks]
 chr16 chr19 chr5_random chr18<----- [From OverdispersedRegionScanSeqs]

There are many situations where this is problematic.  For example, if a 
particular application gives similar output to another application running in 
parallel, it is impossible to distinguish them.  Or, if multiple commands of 
the same application are running, it looks something like this:  
Calculating read count stats...
    444621 Queried Treatment Observations
        Sample_43_CNT_24hr.bam
    313186 Queried Control Observations
        Sample_58_Hep_cnt_72hr.bam

Calculating negative binomial p-values and FDRs in R using DESeq 
(http://www-huber.embl.de/users/anders/DESeq/)...
 chr19 chr5_random chr18

Calculating read count stats...
    396066 Queried Treatment Observations
        Sample_43_CNT_24hr.bam
    279005 Queried Control Observations
        Sample_58_Hep_cnt_72hr.bam

Calculating negative binomial p-values and FDRs in R using DESeq 
(http://www-huber.embl.de/users/anders/DESeq/)...
 chr5_random chr18 chrM

Calculating read count stats...
    445143 Queried Treatment Observations
        Sample_43_CNT_24hr.bam
    308133 Queried Control Observations
        Sample_60_3T3_cnt_72hr.bam

There is no way to tell which initial command produced which particular output 
stats (except that in this case, the application was kind enough to provide us 
with the input file names, but not all applications will do this).  Likewise, 
If multiple commands are running in parallel and one reports an error, it is 
difficult to know which command caused the error.  

Ideally, the output could be saved within separate files for each command that 
is issued, then errors, etc. can be associated with originating commands.  

What version of the product are you using? On what operating system?
bpipe-0.9.5, Linux 

Please provide any additional information below.

Maybe one way to do this would be to have a "command id" in the commandlog.txt 
file, after each command.  Then within .bpipe directory, you could have a 
directory containing files named <commandid>.something containing 1) the 
present working directory, 2) the command as well as 3) the stderr and stdout 
for the command.  And furthermore you could use this commandid as a job name 
for submissions to the cluster, and then when we get an email notification of 
an error (e.g., non-zero exit status), we can go directly to this command 
output file to see the full screen output/error message. 

Original issue reported on code.google.com by andrew.o...@gmail.com on 9 Jul 2012 at 9:53

GoogleCodeExporter commented 9 years ago
Thanks for logging this.  I agree it's something that needs to be addressed, as 
it goes to the issue of reproducibility.  I will see what can be done in small 
ways to address it in the short term and think about more comprehensive 
solutions in the longer term (I am actually thinking the 'bpipe' command itself 
might be a part of this in the sense it could be used to query logs for any 
stage of any run, such as 'bpipe log <runid> <commandid>, or bpipe log <output 
file> (to show the log for command that produced the file), - but that needs 
bigger internal changes).

Original comment by ssade...@gmail.com on 11 Jul 2012 at 11:59

GoogleCodeExporter commented 9 years ago
I would like to agree with the initial poster. For reproducibility as well as 
traceability and archival purposes a "one log per input" would be quite 
essential. I would for example like to integrate these files in our LIMS to 
trace the processing an individual file, i.e. a individual samples, has gone 
through. Combining that with the version tracker integrated in bpipe would be 
quite powerful.

Original comment by NextGen...@gmail.com on 9 Oct 2012 at 1:21