Closed maxulysse closed 2 years ago
Just commenting in case it's useful, but the "functional equivalence" workflow described in this publication does recommend using the -Y option: https://www.nature.com/articles/s41467-018-06159-4
@wpoehlm your timing couldn't be better 🙏 I am rewritting the usage.md at the moment and adding this section right now.
on this note @maxulysse: what we are currently doing is the following:
tumor: "-K 100000000 -M -B 3 -R ${meta.read_group}"
normal: "-K 100000000 -M -R ${meta.read_group}"
here the recommendations of the paper:
Standardized parameters:
Do not use -M since it causes split-read alignments to be marked as "secondary" rather than "supplementary" alignments, violating the BAM specification
Use -K 100000000 to achieve deterministic alignment results (Note: this is a hidden option)
Use -Y to force soft-clipping rather than default hard-clipping of supplementary alignments
Include a .alt file for consumption by BWA-MEM; do not perform post-processing of alternate alignments
Optional parameters (may be useful for convenience and not expected to alter results):
-p (for interleaved fastq)
-C (append FASTA/FASTQ comment to SAM output)
-v (logging verbosity)
-t (threading)
-R (read group header line)
According to this, we should drop -M, also somewhere along the lines -Y got dropped 😱 and also was missing in the last dsl1 release:
// -K is an hidden option, used to fix the number of reads processed by bwa mem
// Chunk size can affect bwa results, if not specified,
// the number of threads can change which can give not deterministic result.
// cf https://github.com/CCDG/Pipeline-Standardization/blob/master/PipelineStandard.md
// and https://github.com/gatk-workflows/gatk4-data-processing/blob/8ffa26ff4580df4ac3a5aa9e272a4ff6bab44ba2/processing-for-variant-discovery-gatk4.b37.wgs.inputs.json#L29
CN = params.sequencing_center ? "CN:${params.sequencing_center}\\t" : ""
readGroup = "@RG\\tID:${idRun}\\t${CN}PU:${idRun}\\tSM:${idSample}\\tLB:${idSample}\\tPL:illumina"
// adjust mismatch penalty for tumor samples
status = statusMap[idPatient, idSample]
extra = status == 1 ? "-B 3" : ""
convertToFastq = hasExtension(inputFile1, "bam") ? "gatk --java-options -Xmx${task.memory.toGiga()}g SamToFastq --INPUT=${inputFile1} --FASTQ=/dev/stdout --INTERLEAVE=true --NON_PF=true | \\" : ""
input = hasExtension(inputFile1, "bam") ? "-p /dev/stdin - 2> >(tee ${inputFile1}.bwa.stderr.log >&2)" : "${inputFile1} ${inputFile2}"
aligner = params.aligner == "bwa-mem2" ? "bwa-mem2" : "bwa"
"""
${convertToFastq}
${aligner} mem -K 100000000 -R \"${readGroup}\" ${extra} -t ${task.cpus} -M ${fasta} \
${input} | \
samtools sort --threads ${task.cpus} -m 2G - > ${idSample}_${idRun}.bam
Not sure about the -B 3. However the paper also doesn't differentiate between tumor and normal samples.
from this I gather we should change the mapping params to:
tumor: "-K 100000000 -Y -B 3 -R ${meta.read_group}"
normal: "-K 100000000 -Y -R ${meta.read_group}"
which would also be in line from what I see from GATK. Not sure about the -B 3 (seems intuitive to allow for more mismatches for tumor samples, but why 3?)
never mind forgot we had an issue open for this already: https://github.com/nf-core/sarek/issues/62.
Sorry for spaming you all by talking to myself, but basically from this I would say we add -Y for both and then determine why -B.
And maybe while we at it we can discuss removal of the -M parameter
Replacing -M
seems mostly relevant in case we add MergeBamAlignment
(so more of a future problem):
-M to flag shorter split hits as secondary.
This is optional for Picard compatibility as MarkDuplicates can directly process BWA's alignment, whether or not the alignment marks secondary hits. However, if we want MergeBamAlignment to reassign proper pair alignments, to generate data comparable to that produced by the Broad Genomics Platform, then we must mark secondary alignments.
The flag helps for picard compatibility, but from this it sounds like we don't need to worry about Markduplicates. (see http://bio-bwa.sourceforge.net/bwa.shtml and https://gatk.broadinstitute.org/hc/en-us/articles/360039568932--How-to-Map-and-clean-up-short-read-sequence-data-efficiently)
closed by #644
Issue by @MaxUlysse, moved from SciLifeLab#770