No reads for mitochondrial chromosome and additional contigs after recalibration

edg1983 commented 7 months ago

Description of the bug

We have run Sarek pipeline v3.1.2 on 3 WGS batches, including around 2500 samples, using the GATK.GRCh38 reference genome from iGenomes and mostly default parameters (see the command).

In all of these samples, the alignment CRAM files from the mapping and markduplicate steps look fine, but in the CRAM file after the recalibration step, all reads outside canonical chromosomes (i.e., chr1-22, X, Y) are lost. This means that we have no reads on the mitochondrial genome or any other additional contig, and therefore, we can't call variants or perform other analyses on chrM, for example.

Is this a known behavior related to recalibration, or is something unexpected happening during the recalibration step?

Command used and terminal output

#variables
pipeline_name="nf-core/sarek"
pipeline_revision="3.1.2"
pipeline_hub="github"
pipeline_work_dir="/scratch/nextflow_works/primary_analysis"

#run nextflow
nextflow run \
    $pipeline_name -r $pipeline_revision -hub $pipeline_hub \
    -c nextflow.config \
    -profile genfac \
    -params-file params.yaml \
    -work-dir $pipeline_work_dir \
    -resume;

#Content of params.yml
#genome: GATK.GRCh38
#igenomes_base: /processing_data/reference_datasets/iGenomes/2023.1
#input: metadata.csv
#outdir: output_nf-core@sarek_3.1.2/
#save_mapped: true

Relevant files

No response

System information

nextflow v23.10.0
sarek v3.1.2
system: HPC
executor: SLURM
container engine: singularity v3.8.5

FriederikeHanssen commented 7 months ago

Hey! Yes this is due to the intervals file that is used. We use the one provided by GATK. You can change the intervals file or skip the intervals usage all together (beware this will increase runtime quite a bit because then no parallelization can get done).

edg1983 commented 7 months ago

Thanks for the explanation. So you are using this file iGenomes/2023.1/Homo_sapiens/GATK/GRCh38/Annotation/intervals/wgs_calling_regions.hg38.bed from the iGenomes release, right?

I'm fine with excluding additional contigs (like decoys, HLA, ALT, etc.), but also excluding chrM is not the best solution, especially when this is applied to the alignments. I guess most people expect to have chrM reads in all the BAM/CRAM files produced by the pipeline. Overall, the current behavior seems a little confusing since the data is present in some BAM/CRAM (mapped and markduplicates), but then removed in the recalibrated one.

FriederikeHanssen commented 7 months ago

I need to read up on it a bit more. We basically just took the reference files as is from GATK and ran with it: https://gatk.broadinstitute.org/hc/en-us/articles/360035889551-When-should-I-restrict-my-analysis-to-specific-intervals

So I am guessing they had a good reason to not have this in the intervals file? In any case you can easily circumvent this by adding the mitochondrial coordinates to the file and specifying it with --intervals.

We are actually using this file: https://github.com/nf-core/sarek/blob/6aeac929c924ba382baa42a0fe969b4e0e753ca9/conf/igenomes.config#L64 which is the same except that we removed the seconds because it was easier for grouping them together. Regions are all the same though.

The duplicate marking is not parallelised along genomic coordinates but everything is processed together which is why nothing is removed.

edg1983 commented 7 months ago

Thanks for the advice. I will procede preparing my custom intervals file for future use.

In think GATK compiled this list of regions mainly for variant calling, to remove noisy or difficult to re-assemble regions that would slow down the variant caller and give low confidence calls. They probably excluded chrM from these regions because they have a dedicated pipeline and calling variants on chrM is usually treated with dedicated methods.

I can understand this approach in the context of variant calling, but keep in mind that using these regions has the subtle consequence of not only skipping chrM, but also specific regions across chromosomes. If this is the default behaviour, this needs to be strongly highlighted in the documentation so that users have the right expectations on the output. For example, some downstream analysis expect variants called genome without any exclusion regions and this is relevant also when performing datasets comparison.

Finally, IMHO, it is never a good practice to remove reads from the BAM files since usually these are supposed to losslessly represent the input sequences (after reads QC and trimming). If you decide not to change the current approach with recalibration then my suggestion is to at least strongly highlight this behaviour in the docs.

nf-core / sarek