Open hh1985 opened 6 years ago
I've actually just committed a change which may be relevant to this.
What is happening is that Bpipe thinks the last stage (humann2) seemingly not declaring an output. In such a case it automatically forwards the previous input as the default input for the downstream stages to use, and in this case it incorrectly resolves it from the input to the overall parallel block, not the input of the last stage. The commit fixes this problem. It would be really helpful if you build from source off master and see if that corrects the behavior or not.
I'm curious though if the humann2
actually does declare an output or not? Or is it one of the prior stages creating the .tsv
files?
Hi @ssadedin,
The new code doesn't fix my problem:(
I tried to reproduce the error with simple code, bpipe works pretty well.
My bpipe code has some customized lib code (.jar) that maps host path to path in docker, which might cause the trouble.
In the following example,
cutprimer = {
requires outdir: "The directory for storing trimmed fastq"
output.dir = output.dir + '/' + outdir
def fprimer = REGISTER.locateParams('workflow', 'data').forward_primer
def rprimer = REGISTER.locateParams('workflow', 'data').reverse_primer
transform("*.fastq.gz") to(".cutP.fastq.gz") {
def obj = configEnv(stageName)
def cmd = "cutadapt -g $fprimer -G $rprimer -o ${obj.mapPathOut(file(output1).getAbsolutePath())} -p ${obj.mapPathOut(file(output2).getAbsolutePath())} ${REGISTER.rewireParams('stage', stageName)} ${obj.mapPathIn(input1 as String)} ${obj.mapPathIn(input2 as String)}"
exec obj.run2(cmd)
}
println outputs
forward outputs
}
adapter3 = {
println inputs
transform("*.cutP.fastq.gz") to(".txt") {
println inputs
exec "touch $outputs"
}
}
Bpipe.run {
"%_*.fastq.gz" * [cutPrimer.using(outdir: "cutPrimer")] + adapter3
}
outputs in cutprimer
are as expected: .cutP.fastq.gz. The first inputs
in stage adapter3
prints .fastq.gz, and the second inputs
prints like ***.cutP.fastq.gz
The log is like:
bpipe.PipelineCategory [1] INFO |11:25:57 There [[id:null, stageName:cutprimer, startMs:1533828357385, endMs:1533828357481, branch:hrk20180713-015-355-224, threadId:36, succeeded:true], [id:null, stageName:cutprimer, startMs:1533828357385, endMs:1533828357482, branch:hrk20180713-015-131-634, threadId:37, succeeded:true], [id:null, stageName:cutprimer, startMs:1533828357385, endMs:1533828357482, branch:hrk20180713-015-141-926, threadId:38, succeeded:true]] parallel paths in final stage
bpipe.PipelineCategory [1] INFO |11:25:57 Last merged outputs are [/home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-355-224_S50_L001_1.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-355-224_S50_L001_2.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-131-634_S48_L001_1.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-131-634_S48_L001_2.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-141-926_S45_L001_1.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-141-926_S45_L001_2.cutP.fastq.gz]
bpipe.Utils [1] INFO |11:25:57 Setting output [/home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-355-224_S50_L001_1.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-355-224_S50_L001_2.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-131-634_S48_L001_1.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-131-634_S48_L001_2.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-141-926_S45_L001_1.cutP.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/cutPrimer/hrk20180713-015-141-926_S45_L001_2.cutP.fastq.gz] on context 260084831 in thread 1
bpipe.PipelineCategory [1] INFO |11:25:57 Merged stage name is cutprimer_cutprimer_cutprimer_bpipe_merge
bpipe.PipelineStage [1] INFO |11:25:57 Stage 2 returned null as default inputs for next stage
bpipe.PipelineStage [1] INFO |11:25:57 Inputs are NOT being inferred from context.output (context.nextInputs=null)
bpipe.PipelineStage [1] INFO |11:25:57 Inferring nextInputs from inputs bpipe.PipelineContext@43f82e78.@input
bpipe.PipelineStage [1] INFO |11:25:57 No explicit output on stage 460570271 context 1140338296
bpipe.PipelineStage [1] INFO |11:25:57 Setting next inputs [/home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/rawdata/hrk20180713-015-355-224_S50_L001_2.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/rawdata/hrk20180713-015-131-634_S48_L001_2.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/rawdata/hrk20180713-015-141-926_S45_L001_2.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/rawdata/hrk20180713-015-141-926_S45_L001_1.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/rawdata/hrk20180713-015-131-634_S48_L001_1.fastq.gz, /home/hanh/projects/xbiome_pipeline_16s/test/rmPrimer/rawdata/hrk20180713-015-355-224_S50_L001_1.fastq.gz] on stage 460570271, context 1140338296 in thread 1
I notice that the id is null ...
A normal simple code gives log like:
bpipe.PipelineCategory [1] INFO |9:26:05 There [[id:0_0-0, stageName:step1, startMs:1533821165525, endMs:1533821165616, branch:abc, threadId:33, succeeded:true], [id:0_0-0, stageName:step1, startMs:1533821165525, endMs:1533821165616, branch:xyz, threadId:34, succeeded:true]] parallel paths in final stage
bpipe.PipelineCategory [1] INFO |9:26:05 Last merged outputs are [/home/hanh/code-repository/bpipe_test/xyz/abc_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/abc_2.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_2.step1.txt]
bpipe.Utils [1] INFO |9:26:05 Setting output [/home/hanh/code-repository/bpipe_test/xyz/abc_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/abc_2.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_2.step1.txt] on context 127702987 in thread 1
bpipe.PipelineCategory [1] INFO |9:26:05 Merged stage name is step1_step1_bpipe_merge
bpipe.PipelineStage [1] INFO |9:26:05 Inputs are NOT being inferred from context.output (context.nextInputs=[/home/hanh/code-repository/bpipe_test/xyz/abc_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/abc_2.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_2.step1.txt])
bpipe.PipelineStage [1] INFO |9:26:05 No explicit output on stage 1884155890 context 237344028
bpipe.PipelineStage [1] INFO |9:26:05 Setting next inputs [/home/hanh/code-repository/bpipe_test/xyz/abc_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/abc_2.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_1.step1.txt, /home/hanh/code-repository/bpipe_test/xyz/xyz_2.step1.txt] on stage 1884155890, context 237344028 in thread 1
Hi @ssadedin ,
I used a bpipe script to process fastq files, but got strange inputs for the final pooling step. The workflow is like:
check_input + "%_[rR]*.fastq.gz" * [ kneaddata + concatenate + humann2] + mergeMetaphlan
humann2
produces tsv file. I assumed the inputs for stagemergeMetaphlan
are a collection of tsv files. However, theinputs
are dereferenced as fastq.gz files (files from“%_[rR]*.fastq.gz”
). This is not the expected behavior.The log looks like:
I have to use
inputs.tsv
inmergeMetaphlan
to force the inference as tsv files. It might be a problem if a splitted stage X also producesfastq.gz
files and I want to combine all the outputfastq.gz
files from stage X's.Any suggestions of the best practice for pooling results?
Thanks,
-hh1985