nf-core / sarek

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing
https://nf-co.re/sarek
MIT License
400 stars 404 forks source link

Should we use -Y option for bwa mem? #101

Closed maxulysse closed 2 years ago

maxulysse commented 4 years ago

Issue by @MaxUlysse, moved from SciLifeLab#770

Use soft clipping CIGAR operation for supplementary alignments. By default, BWA-MEM uses soft clipping for the primary alignment and hard clipping for supplementary alignments.

We're not using this -Y option for bwa mem, but I looked more into the GTAK best practice workflow, and they're using it.

So I'm wondering, should we use that or not?

wpoehlm commented 2 years ago

Just commenting in case it's useful, but the "functional equivalence" workflow described in this publication does recommend using the -Y option: https://www.nature.com/articles/s41467-018-06159-4

FriederikeHanssen commented 2 years ago

@wpoehlm your timing couldn't be better 🙏 I am rewritting the usage.md at the moment and adding this section right now.

on this note @maxulysse: what we are currently doing is the following:

        tumor: "-K 100000000 -M -B 3 -R ${meta.read_group}" 
        normal: "-K 100000000 -M -R ${meta.read_group}" 

here the recommendations of the paper:

Standardized parameters:

    Do not use -M since it causes split-read alignments to be marked as "secondary" rather than "supplementary" alignments, violating the BAM specification
    Use -K 100000000 to achieve deterministic alignment results (Note: this is a hidden option)
    Use -Y to force soft-clipping rather than default hard-clipping of supplementary alignments
    Include a .alt file for consumption by BWA-MEM; do not perform post-processing of alternate alignments

Optional parameters (may be useful for convenience and not expected to alter results):

    -p (for interleaved fastq)
    -C (append FASTA/FASTQ comment to SAM output)
    -v (logging verbosity)
    -t (threading)
    -R (read group header line)

According to this, we should drop -M, also somewhere along the lines -Y got dropped 😱 and also was missing in the last dsl1 release:

    // -K is an hidden option, used to fix the number of reads processed by bwa mem
    // Chunk size can affect bwa results, if not specified,
    // the number of threads can change which can give not deterministic result.
    // cf https://github.com/CCDG/Pipeline-Standardization/blob/master/PipelineStandard.md
    // and https://github.com/gatk-workflows/gatk4-data-processing/blob/8ffa26ff4580df4ac3a5aa9e272a4ff6bab44ba2/processing-for-variant-discovery-gatk4.b37.wgs.inputs.json#L29
    CN = params.sequencing_center ? "CN:${params.sequencing_center}\\t" : ""
    readGroup = "@RG\\tID:${idRun}\\t${CN}PU:${idRun}\\tSM:${idSample}\\tLB:${idSample}\\tPL:illumina"
    // adjust mismatch penalty for tumor samples
    status = statusMap[idPatient, idSample]
    extra = status == 1 ? "-B 3" : ""
    convertToFastq = hasExtension(inputFile1, "bam") ? "gatk --java-options -Xmx${task.memory.toGiga()}g SamToFastq --INPUT=${inputFile1} --FASTQ=/dev/stdout --INTERLEAVE=true --NON_PF=true | \\" : ""
    input = hasExtension(inputFile1, "bam") ? "-p /dev/stdin - 2> >(tee ${inputFile1}.bwa.stderr.log >&2)" : "${inputFile1} ${inputFile2}"
    aligner = params.aligner == "bwa-mem2" ? "bwa-mem2" : "bwa"
    """
    ${convertToFastq}
    ${aligner} mem -K 100000000 -R \"${readGroup}\" ${extra} -t ${task.cpus} -M ${fasta} \
    ${input} | \
    samtools sort --threads ${task.cpus} -m 2G - > ${idSample}_${idRun}.bam

Not sure about the -B 3. However the paper also doesn't differentiate between tumor and normal samples.

FriederikeHanssen commented 2 years ago

from this I gather we should change the mapping params to:

        tumor: "-K 100000000 -Y -B 3 -R ${meta.read_group}" 
        normal: "-K 100000000 -Y -R ${meta.read_group}" 

which would also be in line from what I see from GATK. Not sure about the -B 3 (seems intuitive to allow for more mismatches for tumor samples, but why 3?)

FriederikeHanssen commented 2 years ago

never mind forgot we had an issue open for this already: https://github.com/nf-core/sarek/issues/62.

Sorry for spaming you all by talking to myself, but basically from this I would say we add -Y for both and then determine why -B.

And maybe while we at it we can discuss removal of the -M parameter

FriederikeHanssen commented 2 years ago

Replacing -M seems mostly relevant in case we add MergeBamAlignment (so more of a future problem):

-M to flag shorter split hits as secondary.
This is optional for Picard compatibility as MarkDuplicates can directly process BWA's alignment, whether or not the alignment marks secondary hits. However, if we want MergeBamAlignment to reassign proper pair alignments, to generate data comparable to that produced by the Broad Genomics Platform, then we must mark secondary alignments.

The flag helps for picard compatibility, but from this it sounds like we don't need to worry about Markduplicates. (see http://bio-bwa.sourceforge.net/bwa.shtml and https://gatk.broadinstitute.org/hc/en-us/articles/360039568932--How-to-Map-and-clean-up-short-read-sequence-data-efficiently)

FriederikeHanssen commented 2 years ago

closed by #644