Closed reichan1998 closed 4 months ago
For this one since align_illumina_fastq
is just copying all of the steps (except for CRAM -> fastq conversion), it would be simpler to using a branch
operation to determine if the SAMTOOLS_FASTQ
step is needed (true for CRAM but false if input type is FASTQ).
For example you can see further on in align_short.nf
where there is a branch
operation to determine whether SAMTOOLS_MERGE
will be run:
// Collect all BWAMEM2 output by sample name
BWAMEM2_MEM.out.bam
| map { meta, bam -> [['id': meta.id.split('_')[0..-2].join('_'), 'datatype': meta.datatype], meta.read_count, bam] }
| groupTuple( by: [0] )
| map { meta, read_counts, bams -> [meta + [read_count: read_counts.sum()], bams] }
| branch {
meta, bams ->
single_bam: bams.size() == 1
multi_bams: true
}
| set { ch_bams }
// Merge, but only if there is more than 1 file
SAMTOOLS_MERGE ( ch_bams.multi_bams, [ [], [] ], [ [], [] ] )
ch_versions = ch_versions.mix ( SAMTOOLS_MERGE.out.versions.first() )
The branch
section of this is creating two branches: single_bam
if the sample has a single BAM file, or multi_bams
, which catches all cases where there is not a single BAM file:
| branch {
meta, bams ->
single_bam: bams.size() == 1
multi_bams: true
}
The next step is we run SAMTOOLS_MERGE
only on the multi_bams
branch, and then combine the outputs with the single_bam
branch. The process that you will need to implement for FASTQ vs CRAM inputs will be similar
Also note that ALIGN_SHORT
is actually used for both Illumina and HiC reads, so any changes in documentation for Illumina also applies to HiC:
include { ALIGN_SHORT as ALIGN_HIC } from '../subworkflows/local/align_short'
include { ALIGN_SHORT as ALIGN_ILLUMINA } from '../subworkflows/local/align_short'
Thank you very much for your instructions, @tkchafin. I noticed that ALIGN_SHORT
is used for both Illumina and HiC reads. Therefore, I decided to add a condition to branch reads with the fastq.gz
format in readmapping.nf
, so it will leave the ALIGN_HIC
unchanged. Your approach is certainly more efficient and structured, and I will adjust as you described.
I think we can allow the FASTQ option for both HiC and Illumina, so it might be best to keep the branching inside of align_short (the feature request actually was about HiC https://github.com/sanger-tol/readmapping/issues/88 )
I'm so sorry for the oversight and misunderstanding regarding the FASTQ option for HiC. I've made the changes as per your instructions . Thank you for your review and guidance.
Hi @tkchafin, I've made the changes following your instructions. Could you please help me review them? I really appreciate your guidance. Thank you!
Hi @reichan1998 thanks for these changes. I've been off sick the past few days, but will have a look soon.
I have added the file to the hosted test_data, the path is https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/genomic_data/mMelMel2/illumina/31231_4%231.subset.fastq so you can put this into the assets/samplesheet_s3.csv for mMelMel2,illumina to test the pipeline with the fastq input option for that sample
To solve issue #88
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).