Merged job logging of parallel commands.

lonsbio commented 10 years ago

From andrew.o...@gmail.com on 2012-07-10T07:53:45Z

What steps will reproduce the problem? 1. Run a pipeline with parallel commands.

Look at the screen log output. 3. What is the expected output? What do you see instead? I expect to see the output of each command in a single file somewhere in the .bpipe directory. Instead, I see only one log file for the entire run, with output of all of the commands. This is fine for sequential pipelines, but for pipelines that have jobs running in parallel, this is a problem. For example, a pipeline that runs Cufflinks as well as OverdispersedRegionScanSeqs (USeq package) in parallel, I see this type of thing when the outputs intersect with each other:

[16:18:02] Modeling fragment count overdispersion. [16:18:02] Calculating initial abundance estimates for bias correction.<----- [From Cufflinks]

Processed 16024 loci. [*****] 100% chr17random chr14random chrY chrX<----- [From OverdispersedRegionScanSeqs] [16:38:19] Learning bias parameters.<----- [From Cufflinks] chr16 chr19 chr5_random chr18<----- [From OverdispersedRegionScanSeqs]

There are many situations where this is problematic. For example, if a particular application gives similar output to another application running in parallel, it is impossible to distinguish them. Or, if multiple commands of the same application are running, it looks something like this:
Calculating read count stats... 444621 Queried Treatment Observations Sample_43_CNT_24hr.bam 313186 Queried Control Observations Sample_58_Hep_cnt_72hr.bam

Calculating negative binomial p-values and FDRs in R using DESeq ( http://www-huber.embl.de/users/anders/DESeq/).. . chr19 chr5_random chr18

Calculating read count stats... 396066 Queried Treatment Observations Sample_43_CNT_24hr.bam 279005 Queried Control Observations Sample_58_Hep_cnt_72hr.bam

Calculating negative binomial p-values and FDRs in R using DESeq ( http://www-huber.embl.de/users/anders/DESeq/).. . chr5_random chr18 chrM

Calculating read count stats... 445143 Queried Treatment Observations Sample_43_CNT_24hr.bam 308133 Queried Control Observations Sample_60_3T3_cnt_72hr.bam

There is no way to tell which initial command produced which particular output stats (except that in this case, the application was kind enough to provide us with the input file names, but not all applications will do this). Likewise, If multiple commands are running in parallel and one reports an error, it is difficult to know which command caused the error.

Ideally, the output could be saved within separate files for each command that is issued, then errors, etc. can be associated with originating commands. What version of the product are you using? On what operating system? bpipe-0.9.5, Linux Please provide any additional information below. Maybe one way to do this would be to have a "command id" in the commandlog.txt file, after each command. Then within .bpipe directory, you could have a directory containing files named .something containing 1) the present working directory, 2) the command as well as 3) the stderr and stdout for the command. And furthermore you could use this commandid as a job name for submissions to the cluster, and then when we get an email notification of an error (e.g., non-zero exit status), we can go directly to this command output file to see the full screen output/error message.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=42

lonsbio commented 10 years ago

From ssade...@gmail.com on 2012-07-11T16:59:37Z

Thanks for logging this. I agree it's something that needs to be addressed, as it goes to the issue of reproducibility. I will see what can be done in small ways to address it in the short term and think about more comprehensive solutions in the longer term (I am actually thinking the 'bpipe' command itself might be a part of this in the sense it could be used to query logs for any stage of any run, such as 'bpipe log , or bpipe log (to show the log for command that produced the file), - but that needs bigger internal changes).

Labels: -Priority-Medium Priority-High

lonsbio commented 10 years ago

From NextGen...@gmail.com on 2012-10-08T18:21:38Z

I would like to agree with the initial poster. For reproducibility as well as traceability and archival purposes a "one log per input" would be quite essential. I would for example like to integrate these files in our LIMS to trace the processing an individual file, i.e. a individual samples, has gone through. Combining that with the version tracker integrated in bpipe would be quite powerful.

ssadedin / bpipe

Merged job logging of parallel commands. #42