Closed tkchafin closed 2 months ago
@tkchafin . Talk to Yumi and Ksenia. The question has come up multiple times in the pipelines meeting between them and Priyanka. Every time, the conclusion was "it's actually all fine", but please let's confirm again and record the reason as a comment somewhere in the code :)
Ok, will put together some comparisons of read numbers and request a code review from treeval folks if it doesn't come out making sense
The Treeval bwa-mem2 configuration is specifically designed for the cram_filter module. We strictly keep only primary reads and pass them to bwa-mem2. The -5SPCp flags specify that the expected input reads are in interleaved paired end format and ensure that no reads are ignored (e.g., read1, read2, read1, read2). These settings are crucial for ensuring that each I/O operation functions correctly. We can discuss this further during the pipeline meeting if needed. @reichan1998 I don't understand why you set interleave=False for hic mapping as this will produce two fastq files, and when you using bwa-mem2 with -p, you will miss half of the read for sure, could you explain what you are trying to do please, and could you provide more information.
I have checked the pipelines, both ALIGN_ILLUMINA and ALIGN_HIC would work if your input is cram file. And If you use fastq file as input then ALIGN_ILLUMINA expects two fastq files (read1 and read2) while ALIGN_HIC expects a single interleaved fastq input.
@muffato Do you think we should make this a bit more clear and more consistent for the two subworkflows.
Yes Treeval does the IO with -p
correctly. In readmapping, we convert the CRAM to FASTQ first using the separate module:
// Convert from CRAM to FASTQ only if CRAM files were provided as input
SAMTOOLS_FASTQ ( ch_reads.cram, false )
ch_versions = ch_versions.mix ( SAMTOOLS_FASTQ.out.versions.first() )
SAMTOOLS_FASTQ.out.fastq
| mix ( ch_reads.fastq )
| set { ch_reads_fastq }
// Align Fastq to Genome and output sorted BAM
BWAMEM2_MEM ( ch_reads_fastq, index, fasta, true )
ch_versions = ch_versions.mix ( BWAMEM2_MEM.out.versions.first() )
which isn't creating interleaved output: https://github.com/sanger-tol/readmapping/blob/main/modules/nf-core/samtools/fastq/main.nf
However, in that case we are still calling bwa-mem with -p
(only for HiC):
withName: '.*:.*:ALIGN_HIC:BWAMEM2_MEM' {
ext.args = { "-5SPCp -R ${meta.read_group}" }
}
So this makes a call like this in the case of HiC CRAM input:
bwa-mem2 \
mem \
-5SPCp -R '@RG\tID:35528_2%231\tPL:ILLUMINA\tSM:mMelMel3' \
-t 2 \
$INDEX \
mMelMel3_T1_1.fastq.gz mMelMel3_T1_2.fastq.gz\
| samtools sort -@ 2 -o mMelMel3_T1.bam -
This wa discovered by Chau. Note we are passing read1 and read2 but also using -p
. For Illumina CRAM inputs we handle it correctly because we don't set -p
, but with HiC CRAM inputs I think we must be dumping the read2 files (only in readmapping, not in treeval)
two options: (1) remove the -p from ALIGN_HIC. (2) add -p to ALIGN_ILLUMINA and set interleave as true.
Making -p as flexible argument can be also good in this case.
Yes I just wanted to have a second pair of eyes confirm that this was wrong and I wasn't overlooking something. It looks it would have been introduced in 1.1.0:
interleave=False: https://github.com/sanger-tol/readmapping/blob/1.1.0/subworkflows/local/align_short.nf bwa-mem with -p: https://github.com/sanger-tol/readmapping/blob/1.1.0/conf/modules.config
The issue would only apply to HiC CRAM inputs
Did a quick verification, with 1.2.2 the CRAM hic input mapped is:
(nf-core) tc25@mib119777s hic % samtools view -c GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram
10125
With the latest PR we get:
(nf-core) tc25@mib119777s hic % samtools view -c GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram
20331
Both with bwa-mem2. So I will mark this as resolved. Thanks @reichan1998 for finding this issue!
Description of the bug
@reichan1998 and I came across this while porting bwa-mem for Illumina into the map-reduce setup from treeval. I don't understand why we have
bwa-mem
for HiC configured in this way, but we first convert CRAM to FASTQ, with interleave=False so we are splitting read 1 and read 2 sets. These are then both passed tobwa-mem
, but in the case of HiC we are using the-p
flag:According to the docs, this means we ignore the _R2 reads and treat the _R1 file as interleaved:
Read 2 is non-empty for HiC runs after the CRAM->FASTQ conversion, so would be discarded when running with
-p
(?).In Treeval,
SAMTOOLS_FASTQ
is run creating interleaved output, then passed explicitly as interleaved (-p
) toBWA-MEM2
:This results in an output CRAM with half the number of reads. It has been there since at least v1.0.0, and I haven't worked much with HiC data, so I wanted to check if there was a reason why we do it this way, or if I am misinterpreting the flags?
@muffato
Command used and terminal output
No response
Relevant files
No response
System information
No response