sanger-tol / readmapping

Nextflow DSL2 pipeline to align short and long reads to genome assembly. This workflow is part of the Tree of Life production suite.
https://pipelines.tol.sanger.ac.uk/readmapping
MIT License
11 stars 6 forks source link

bwa-mem configuration for HiC #118

Closed tkchafin closed 2 months ago

tkchafin commented 2 months ago

Description of the bug

@reichan1998 and I came across this while porting bwa-mem for Illumina into the map-reduce setup from treeval. I don't understand why we have bwa-mem for HiC configured in this way, but we first convert CRAM to FASTQ, with interleave=False so we are splitting read 1 and read 2 sets. These are then both passed to bwa-mem, but in the case of HiC we are using the -p flag:

    withName: '.*:.*:ALIGN_HIC:BWAMEM2_MEM' {
        ext.args = { "-5SPCp -R ${meta.read_group}" }
    }

    withName: '.*:.*:ALIGN_ILLUMINA:BWAMEM2_MEM' {
        ext.args = { "-R ${meta.read_group}" }
    }

According to the docs, this means we ignore the _R2 reads and treat the _R1 file as interleaved:

   -p            smart pairing (ignoring in2.fq)

Read 2 is non-empty for HiC runs after the CRAM->FASTQ conversion, so would be discarded when running with -p (?).

In Treeval, SAMTOOLS_FASTQ is run creating interleaved output, then passed explicitly as interleaved (-p) to BWA-MEM2:

    withName: ".*:.*:HIC_BWAMEM2:CRAM_FILTER_ALIGN_BWAMEM2_FIXMATE_SORT" {
        ext.args        = ""
        ext.args1       = "-F0xB00 -nt"      # samtools fastq args
        ext.args2       = { "-5SPCp -H'${rglines}'" }      # bwa-mem2 args
        ext.args3       = "-mpu"  
        ext.args4       = { "--write-index -l1" }
    }

This results in an output CRAM with half the number of reads. It has been there since at least v1.0.0, and I haven't worked much with HiC data, so I wanted to check if there was a reason why we do it this way, or if I am misinterpreting the flags?

@muffato

Command used and terminal output

No response

Relevant files

No response

System information

No response

muffato commented 2 months ago

@tkchafin . Talk to Yumi and Ksenia. The question has come up multiple times in the pipelines meeting between them and Priyanka. Every time, the conclusion was "it's actually all fine", but please let's confirm again and record the reason as a comment somewhere in the code :)

tkchafin commented 2 months ago

Ok, will put together some comparisons of read numbers and request a code review from treeval folks if it doesn't come out making sense 

yumisims commented 2 months ago

The Treeval bwa-mem2 configuration is specifically designed for the cram_filter module. We strictly keep only primary reads and pass them to bwa-mem2. The -5SPCp flags specify that the expected input reads are in interleaved paired end format and ensure that no reads are ignored (e.g., read1, read2, read1, read2). These settings are crucial for ensuring that each I/O operation functions correctly. We can discuss this further during the pipeline meeting if needed. @reichan1998 I don't understand why you set interleave=False for hic mapping as this will produce two fastq files, and when you using bwa-mem2 with -p, you will miss half of the read for sure, could you explain what you are trying to do please, and could you provide more information.

yumisims commented 2 months ago

I have checked the pipelines, both ALIGN_ILLUMINA and ALIGN_HIC would work if your input is cram file. And If you use fastq file as input then ALIGN_ILLUMINA expects two fastq files (read1 and read2) while ALIGN_HIC expects a single interleaved fastq input.

@muffato Do you think we should make this a bit more clear and more consistent for the two subworkflows.

tkchafin commented 2 months ago

Yes Treeval does the IO with -p correctly. In readmapping, we convert the CRAM to FASTQ first using the separate module:

    // Convert from CRAM to FASTQ only if CRAM files were provided as input
    SAMTOOLS_FASTQ ( ch_reads.cram, false )
    ch_versions = ch_versions.mix ( SAMTOOLS_FASTQ.out.versions.first() )

    SAMTOOLS_FASTQ.out.fastq
    | mix ( ch_reads.fastq )
    | set { ch_reads_fastq }

    // Align Fastq to Genome and output sorted BAM
    BWAMEM2_MEM ( ch_reads_fastq, index, fasta, true )
    ch_versions = ch_versions.mix ( BWAMEM2_MEM.out.versions.first() )

which isn't creating interleaved output: https://github.com/sanger-tol/readmapping/blob/main/modules/nf-core/samtools/fastq/main.nf

However, in that case we are still calling bwa-mem with -p (only for HiC):

    withName: '.*:.*:ALIGN_HIC:BWAMEM2_MEM' {
        ext.args = { "-5SPCp -R ${meta.read_group}" }
    }

So this makes a call like this in the case of HiC CRAM input:

bwa-mem2 \
    mem \
    -5SPCp -R '@RG\tID:35528_2%231\tPL:ILLUMINA\tSM:mMelMel3' \
    -t 2 \
    $INDEX \
    mMelMel3_T1_1.fastq.gz mMelMel3_T1_2.fastq.gz\
    | samtools sort  -@ 2  -o mMelMel3_T1.bam -

This wa discovered by Chau. Note we are passing read1 and read2 but also using -p. For Illumina CRAM inputs we handle it correctly because we don't set -p, but with HiC CRAM inputs I think we must be dumping the read2 files (only in readmapping, not in treeval)

yumisims commented 2 months ago

two options: (1) remove the -p from ALIGN_HIC. (2) add -p to ALIGN_ILLUMINA and set interleave as true.

Making -p as flexible argument can be also good in this case.

tkchafin commented 2 months ago

Yes I just wanted to have a second pair of eyes confirm that this was wrong and I wasn't overlooking something. It looks it would have been introduced in 1.1.0:

interleave=False: https://github.com/sanger-tol/readmapping/blob/1.1.0/subworkflows/local/align_short.nf bwa-mem with -p: https://github.com/sanger-tol/readmapping/blob/1.1.0/conf/modules.config

The issue would only apply to HiC CRAM inputs

tkchafin commented 2 months ago

Did a quick verification, with 1.2.2 the CRAM hic input mapped is:

(nf-core) tc25@mib119777s hic % samtools view -c GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram
10125

With the latest PR we get:

(nf-core) tc25@mib119777s hic % samtools view -c GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram
20331

Both with bwa-mem2. So I will mark this as resolved. Thanks @reichan1998 for finding this issue!