google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.17k stars 713 forks source link

Expand default exclude contigs for GRCh38 #37

Closed jjfarrell closed 5 years ago

jjfarrell commented 6 years ago

Looking through the code, it appears the default contigs for exclusion are based on hs37d5. Could this exclude list be expanded to include the appropriate GRCh38 contigs?

Also why is the mitochondria is on the default contig exclusion list? Is there a technical/algorithm issue trying to run deepvariant on the MT contig?

exclude_contigs=[
          # The two canonical names for the contig representing the human
          # mitochondrial sequence.
          'chrM',
          'MT',
          # From hs37d5.
          # (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/README_human_reference_20110707)  # pylint:disable=line-too-long
          'GL000207.1',
          'GL000226.1',
          'GL000229.1',
          'GL000231.1',
          'GL000210.1',
depristo commented 6 years ago

We'd be happy to exclude commonly known non-analyzable contigs on b38. Do you have a proposed list of such contig names, along with why they should be excluded?

The MT: we don't expect, like most variant callers, to properly handle the structure of genetic variation within the mitochrondira. You can read a bit about the challenges of MT calling here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4201154/

jjfarrell commented 6 years ago

Thanks! Attached is a list from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

This list has all the GRCh38 contigs after dropping chr1-chr22, chrX, chrY and chrM.

GRCh38 has lots of additional contigs which are described here: https://software.broadinstitute.org/gatk/documentation/article.php?id=7857 https://software.broadinstitute.org/gatk/blog?id=8180

GRCh38.exclude.contigs.txt

depristo commented 6 years ago

Thanks @jjfarrell. We've added a bug and will fix for the next release.

rpoplin commented 6 years ago

Nice! This is fixed now and will go out with the next release of DeepVariant. Thanks for the great suggestion.

PedalheadPHX commented 5 years ago

@rpoplin @depristo Is there a specific reason to exclude all of the super contigs in GRCh38 from deepvariant analysis? I agree with removing the HLA, EBV, M/MT and decoys, but removing the placed and unplaced contigs doesn't seem justified. Other callers do work on these regions so it doesn't seem like they are truly "Common problematic contigs on GRCh38" as stated in the release notes for v0.5.0.

Does it affect training or calling if they were included? I'd suggest the user should be able to provide the list of regions to exclude or include as its taken a bit of work to sort out why deepvariant is not making calls on all contigs in the provided BAM files like a number of other callers we are testing

pichuan commented 5 years ago

Thanks @PedalheadPHX Reopening this bug so it's easier to track that we have an unanswered question. Also assigning to @AndrewCarroll

AndrewCarroll commented 5 years ago

Hi @jjfarrell

I hadn't encountered users who meaningfully used the unplaced, but non-decoy contigs. The exclude list is current coded into DeepVariant and not modifiable. However, I agree that it would be reasonable to assess adding those in.

The main reason a decoy or unplaced contig should be excluded is if it aggregates a large amount of incorrectly mapped coverage. It makes sense for us to investigate how much the placed and unplaced contigs are affected by that and decide based on those results.

Training will not be affected, because these contigs do not have truth label variants that I am aware of. The model will be the same.

We'll consider this a potential change for the next release, which will hopefully be in the next month or two. I hope that timing will be adequate.

ESDeutekom commented 1 year ago

Dear @AndrewCarroll ,

I know this issue was closed, but I did not want to open a new one for a question this trivial. Do I understand correctly the team has now implemented some form of alt contig consideration, as the documentation now says: "Second, the BAM file provided to --reads should be aligned to a "compatible" version of the genome reference provided as the --ref. By compatible here we mean the BAM and FASTA share at least a reasonable set of common contigs, as DeepVariant will only process contigs shared by both the BAM and reference. "" [https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-details.md].

Thanks in advance

pgrosu commented 1 year ago

Dear @ESDeutekom,

There are a few things to unpack here. The goal is to make sure that the BAM file and reference FASTA file match in their contigs and bases to ensure correct variants get called between the two, and if local realignment is required that the regions match appropriately. Thus, a couple of steps get performed:

$1)$ So in order to be sure there is proper consensus between the two, a first pass over the reference gets processed to remove contigs based on an internal list of excluded contigs found in the following file:

https://github.com/google/deepvariant/blob/r1.5/deepvariant/exclude_contigs.py

This is done in the reference first, because most likely these would not exist in most BAM files.

$2)$ Now given the remaining contigs from the reference, these are used to find common contigs that exist in the BAM file as well, to ensure there is proper overlap. All of this (including $Step \; 1$ above) is performed by the _ensure_consistent_contigs function.

The two steps summarize the statement you noticed in the documentation.

Now the only excluding option I see is for exclude_regions, though the rest of the above excluded contigs are hard-coded as exclude_contigs=exclude_contigs.EXCLUDED_HUMAN_CONTIGS. To exclude a region, that would come as a space-separated list of regions, which can be region literals (e.g., chr20:10-20) or paths to BED/BEDPE files.

Hope it helps, Paul

pichuan commented 1 year ago

Dear @AndrewCarroll ,

I know this issue was closed, but I did not want to open a new one for a question this trivial. Do I understand correctly the team has now implemented some form of alt contig consideration, as the documentation now says: "Second, the BAM file provided to --reads should be aligned to a "compatible" version of the genome reference provided as the --ref. By compatible here we mean the BAM and FASTA share at least a reasonable set of common contigs, as DeepVariant will only process contigs shared by both the BAM and reference. "" [https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-details.md].

Thanks in advance

Hi @ESDeutekom ,

Note that the documentation you quoted actually has been there for a while now, since 2018: https://github.com/google/deepvariant/blob/r0.6/docs/deepvariant-details.md If this wording is confusing, we're happy to improve it. Please let us know which part is particularly confusing.

If your question is whether DeepVariant takes alt contigs into account, the answer also hasn't changed - not really. In general, when choosing between mapping to GRCh38 with alt contigs or without alt contigs, we empirically find that it's better to use the latter.

Hope this helps.

ESDeutekom commented 1 year ago

Dear @pgrosu and @pichuan,

Thank you for your answers. This does help me further along. With regards to the following:

In general, when choosing between mapping to GRCh38 with alt contigs or without alt contigs, we empirically find that it's better to use the latter.

Is this stated somwhere, or published with comparative results? I am curious to get some insight and elaboration on why this would happen.

Kind regards,
Eva

kishwarshafin commented 1 year ago

@ESDeutekom , this article https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use from Heng describes different caveats of using different reference genomes. As suggested in that article and from our previous experience we found GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz to be the most suitable.

pgrosu commented 1 year ago

Hi Eva,

Usually such comparisons are rare, as Kishwar suggested through Heng's article. If you are curious about such a comparison, below is a table generated on GIAB samples through GATK's HaplotypeCaller on GRCh38 and GRCh38 with masked alt regions:

image

Hope it helps, ~p

ESDeutekom commented 1 year ago

Dear @pgrosu and @kishwarshafin,

Thank you very much for the added information. This certainly helps me along.

Kind regards,