Inputs when pooling results

hh1985 commented 6 years ago

Hi @ssadedin ,

I used a bpipe script to process fastq files, but got strange inputs for the final pooling step. The workflow is like: check_input + "%_[rR]*.fastq.gz" * [ kneaddata + concatenate + humann2] + mergeMetaphlan

humann2 produces tsv file. I assumed the inputs for stage mergeMetaphlan are a collection of tsv files. However, the inputs are dereferenced as fastq.gz files (files from “%_[rR]*.fastq.gz”). This is not the expected behavior.

The log looks like:

1301 bpipe.PipelineCategory  [1] INFO    |12:44:10 There [[id:null, stageName:humann2, startMs:1530031449788, endMs:1530031449801, branch:p136C, threadId:40, succeeded:true], [id:null, stageName:humann2, startMs     :1530031449788, endMs:1530031449811, branch:p136N, threadId:41, succeeded:true]] parallel paths in final stage
1302 bpipe.PipelineCategory  [1] INFO    |12:44:10 Last merged outputs are [/home/hanh/projects/xbiome_pipeline_16s/test/meta_profiling/humann2Prof/p136C_humann2_out/p136C_humann2_temp/p136C_metaphlan_bugs_list.tsv, /home/hanh/projects/xbiome_pipeline_16s/test/meta_profiling/humann2Prof/p136N_humann2_out/p136N_humann2_temp/p136N_metaphlan_bugs_list.tsv]
1303 bpipe.Utils [1] INFO    |12:44:10 Setting output [/home/hanh/projects/xbiome_pipeline_16s/test/meta_profiling/humann2Prof/p136C_humann2_out/p136C_humann2_temp/p136C_metaphlan_bugs_list.tsv, /home/hanh/proje     cts/xbiome_pipeline_16s/test/meta_profiling/humann2Prof/p136N_humann2_out/p136N_humann2_temp/p136N_metaphlan_bugs_list.tsv] on context 1221981006 in thread 1
1304 bpipe.PipelineCategory  [1] INFO    |12:44:10 Merged stage name is humann2_humann2_bpipe_merge
1305 bpipe.PipelineStage [1] INFO    |12:44:10 Stage 2 returned null as default inputs for next stage
1306 bpipe.PipelineStage [1] INFO    |12:44:10 Inputs are NOT being inferred from context.output (context.nextInputs=null)
1307 bpipe.PipelineStage [1] INFO    |12:44:10 Inferring nextInputs from inputs bpipe.PipelineContext@cd1d761.@input
1308 bpipe.PipelineStage [1] INFO    |12:44:10 No explicit output on stage 1914108708 context 215078753 so output is nextInputs [/home/hanh/projects/xbiome_pipeline_16s/test/data/meta_tutorial/mgs_tutorial_Oct20     17/raw_data/run/p136C_R1.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/data/meta_tutorial/mgs_tutorial_Oct2017/raw_data/run/p136N_R1.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/data/meta     _tutorial/mgs_tutorial_Oct2017/raw_data/run/p136N_R2.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/data/meta_tutorial/mgs_tutorial_Oct2017/raw_data/run/p136C_R2.fastq.gz]
1309 bpipe.Utils [1] INFO    |12:44:10 Setting output [/home/hanh/projects/xbiome_pipeline_16s/test/data/meta_tutorial/mgs_tutorial_Oct2017/raw_data/run/p136C_R1.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s     /test/data/meta_tutorial/mgs_tutorial_Oct2017/raw_data/run/p136N_R1.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/data/meta_tutorial/mgs_tutorial_Oct2017/raw_data/run/p136N_R2.fastq.gz, /home/hanh/     projects/xbiome_pipeline_16s/test/data/meta_tutorial/mgs_tutorial_Oct2017/raw_data/run/p136C_R2.fastq.gz] on context 215078753 in thread 1
1310 bpipe.PipelineStage [1] INFO    |12:44:10 Setting next inputs [/home/hanh/projects/xbiome_pipeline_16s/test/data/meta_tutorial/mgs_tutorial_Oct2017/raw_data/run/p136C_R1.fastq.gz, /home/hanh/projects/xbiome     _pipeline_16s/test/data/meta_tutorial/mgs_tutorial_Oct2017/raw_data/run/p136N_R1.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/data/meta_tutorial/mgs_tutorial_Oct2017/raw_data/run/p136N_R2.fastq.gz     , /home/hanh/projects/xbiome_pipeline_16s/test/data/meta_tutorial/mgs_tutorial_Oct2017/raw_data/run/p136C_R2.fastq.gz] on stage 1914108708, context 215078753 in thread 1

I have to use inputs.tsv in mergeMetaphlan to force the inference as tsv files. It might be a problem if a splitted stage X also produces fastq.gz files and I want to combine all the output fastq.gz files from stage X's.

Any suggestions of the best practice for pooling results?

Thanks,

-hh1985

ssadedin commented 6 years ago

I've actually just committed a change which may be relevant to this.

What is happening is that Bpipe thinks the last stage (humann2) seemingly not declaring an output. In such a case it automatically forwards the previous input as the default input for the downstream stages to use, and in this case it incorrectly resolves it from the input to the overall parallel block, not the input of the last stage. The commit fixes this problem. It would be really helpful if you build from source off master and see if that corrects the behavior or not.

I'm curious though if the humann2 actually does declare an output or not? Or is it one of the prior stages creating the .tsv files?

hh1985 commented 6 years ago

Hi @ssadedin,

The new code doesn't fix my problem:(

I tried to reproduce the error with simple code, bpipe works pretty well.

My bpipe code has some customized lib code (.jar) that maps host path to path in docker, which might cause the trouble.

In the following example,

cutprimer = { 
    requires outdir: "The directory for storing trimmed fastq"

    output.dir = output.dir + '/' + outdir

    def fprimer = REGISTER.locateParams('workflow', 'data').forward_primer
    def rprimer = REGISTER.locateParams('workflow', 'data').reverse_primer

    transform("*.fastq.gz") to(".cutP.fastq.gz") {
        def obj = configEnv(stageName)
        def cmd = "cutadapt -g $fprimer -G $rprimer -o ${obj.mapPathOut(file(output1).getAbsolutePath())} -p ${obj.mapPathOut(file(output2).getAbsolutePath())} ${REGISTER.rewireParams('stage', stageName)} ${obj.mapPathIn(input1 as String)} ${obj.mapPathIn(input2 as String)}"
        exec obj.run2(cmd)
    }   

    println outputs
    forward outputs
}

adapter3 = { 
    println inputs
    transform("*.cutP.fastq.gz") to(".txt") {
        println inputs
        exec "touch $outputs"
    }   
}

Bpipe.run {
      "%_*.fastq.gz" * [cutPrimer.using(outdir: "cutPrimer")] + adapter3
}

outputs in cutprimer are as expected: .cutP.fastq.gz. The first inputs in stage adapter3 prints .fastq.gz, and the second inputs prints like ***.cutP.fastq.gz

The log is like:

bpipe.PipelineCategory  [1] INFO    |11:25:57 There [[id:null, stageName:cutprimer, startMs:1533828357385, endMs:1533828357481, branch:hrk20180713-015-355-224, threadId:36, succeeded:true], [id:null, stageName:cutprimer, startMs:1533828357385, endMs:1533828357482, branch:hrk20180713-015-131-634, threadId:37, succeeded:true], [id:null, stageName:cutprimer, startMs:1533828357385, endMs:1533828357482, branch:hrk20180713-015-141-926, threadId:38, succeeded:true]] parallel paths in final stage
bpipe.PipelineCategory  [1] INFO    |11:25:57 Last merged outputs are [/home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-355-224_S50_L001_1.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-355-224_S50_L001_2.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-131-634_S48_L001_1.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-131-634_S48_L001_2.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-141-926_S45_L001_1.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-141-926_S45_L001_2.cutP.fastq.gz] 
bpipe.Utils [1] INFO    |11:25:57 Setting output [/home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-355-224_S50_L001_1.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-355-224_S50_L001_2.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-131-634_S48_L001_1.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-131-634_S48_L001_2.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-141-926_S45_L001_1.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-141-926_S45_L001_2.cutP.fastq.gz] on context 260084831 in thread 1
bpipe.PipelineCategory  [1] INFO    |11:25:57 Merged stage name is cutprimer_cutprimer_cutprimer_bpipe_merge
bpipe.PipelineStage [1] INFO    |11:25:57 Stage 2 returned null as default inputs for next stage
bpipe.PipelineStage [1] INFO    |11:25:57 Inputs are NOT being inferred from context.output (context.nextInputs=null)
bpipe.PipelineStage [1] INFO    |11:25:57 Inferring nextInputs from inputs bpipe.PipelineContext@43f82e78.@input
bpipe.PipelineStage [1] INFO    |11:25:57 No explicit output on stage 460570271 context 1140338296
bpipe.PipelineStage [1] INFO    |11:25:57 Setting next inputs [/home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/rawdata/hrk20180713-015-355-224_S50_L001_2.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/rawdata/hrk20180713-015-131-634_S48_L001_2.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/rawdata/hrk20180713-015-141-926_S45_L001_2.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/rawdata/hrk20180713-015-141-926_S45_L001_1.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/rawdata/hrk20180713-015-131-634_S48_L001_1.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/rawdata/hrk20180713-015-355-224_S50_L001_1.fastq.gz] on stage 460570271, context 1140338296 in thread 1

I notice that the id is null ...

A normal simple code gives log like:

bpipe.PipelineCategory  [1] INFO    |9:26:05 There [[id:0_0-0, stageName:step1, startMs:1533821165525, endMs:1533821165616, branch:abc, threadId:33, succeeded:true], [id:0_0-0, stageName:step1, startMs:1533821165525, endMs:1533821165616, branch:xyz, threadId:34, succeeded:true]] parallel paths in final stage 
bpipe.PipelineCategory  [1] INFO    |9:26:05 Last merged outputs are [/home/hanh/code-repository/bpipe_test/xyz/abc_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/abc_2.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_2.step1.txt] 
bpipe.Utils [1] INFO    |9:26:05 Setting output [/home/hanh/code-repository/bpipe_test/xyz/abc_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/abc_2.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_2.step1.txt] on context 127702987 in thread 1 
bpipe.PipelineCategory  [1] INFO    |9:26:05 Merged stage name is step1_step1_bpipe_merge 
bpipe.PipelineStage [1] INFO    |9:26:05 Inputs are NOT being inferred from context.output (context.nextInputs=[/home/hanh/code-repository/bpipe_test/xyz/abc_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/abc_2.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_2.step1.txt]) 
bpipe.PipelineStage [1] INFO    |9:26:05 No explicit output on stage 1884155890 context 237344028 
bpipe.PipelineStage [1] INFO    |9:26:05 Setting next inputs [/home/hanh/code-repository/bpipe_test/xyz/abc_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/abc_2.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_2.step1.txt] on stage 1884155890, context 237344028 in thread 1

ssadedin / bpipe

Inputs when pooling results #239