populationgenomics / production-pipelines

Genomics workflows for CPG using Hail Batch
MIT License
2 stars 0 forks source link

Exome calling lists include chrM #735

Open MattWellie opened 1 month ago

MattWellie commented 1 month ago

Recently Paul (VCGS) flagged that Mito variants are coming through into Exome AIP reports

The final line of gs://cpg-common-main/references/hg38/v0/exome_calling_regions.v1.interval_list is chrM 427 16173 ...

The whole-genome equivalent gs://cpg-common-main/references/hg38/v0/wgs_calling_regions.v1.interval_list terminates at chrY

We have a separate mito calling workflow, so is this accidental calling?

EddieLF commented 1 month ago

This is sourced from the public Broad references

$gsutil cat gs://gcp-public-data--broad-references/hg38/v0/exome_calling_regions.v1.interval_list
...
chrM    427 16173 ...

It's only a few kb, we could easily subtract it from our exome interval list, but I'm curious what is the downside to calling variants in this region? Is it just that it's better to leave them to our mito specific pipeline?

MattWellie commented 1 month ago

The downside here is that Mito calling is fundamentally different - instead of 2 chromosomes, cells contain a load of mitochondrial genome copies, instead of WT/Het/Hom, mito calling is a continuous range of % Mito genomes with a variant. In that sense it's more like somatic/cancer analysis where variants can be picked up at really low levels whilst still being true, so it takes a different approach to get clean results.

HaplotypeCaller and JointGenotyping aren't optimised for that - it's not to say the results are bad or wrong, just that we've probably not looked into the quality of calls in this region.

cassimons commented 1 month ago

Our mito calling pipeline is currently untested on exomes, but should work perhaps with some minor tweaks. Different exome captures are variable in how many mito reads they return but modern ones have decent coverage so we should move to support this. Once that is done we should remove Mt from our target intervals.