populationgenomics / automated-interpretation-pipeline

Rare Disease variant prioritisation MVP
MIT License
5 stars 4 forks source link

Isolate Annotation from Seqr loading pipeline #9

Closed MattWellie closed 2 years ago

MattWellie commented 2 years ago

Currently the annotation step (within Dataproc) is not the sole source of annotations

As the process begins from a MT instead of a VCF, some of the annotations on that original MT are carried over instead of being updated. These are:

The source for these annotations: https://github.com/populationgenomics/hail-elasticsearch-pipelines/tree/main/download_and_create_reference_datasets/v02 https://github.com/populationgenomics/hail-elasticsearch-pipelines/blob/main/download_and_create_reference_datasets/v02/hail_scripts/write_combined_reference_data_ht.py#L39-L48

The annotations are available as a single Hail Table, which is used to annotate the MatrixTable as a join.

For now - make sure the Annotation stage can supply these annotations, with as little dependence on external libraries as possible

Future - obtain all annotations from a single source, so that this bespoke preparation isn't required?

MattWellie commented 2 years ago

Plan with @vladsaveliev:

This means:

Next:

MattWellie commented 2 years ago

Note - the current logic mixes 3 things:

  1. annotating a VCF in chunks with VEP
  2. applying those VEP annotations to a MT
  3. applying VQSR annotations to a MT

Instead this will focus on parts 1 & 2, with any references to VQSR removed

tiboloic commented 2 years ago

in reply to this comment

gnomAD has indeed generated various hail tables containing all the possible SNPs and annotated them with VEP. Here is the location of some of them:

The VEP 85 and 95 Tables are also available through various cloud providers' open datasets programs and can be downloaded without paying egress. More detail about how to access those can be found in this blog post and the gnomAD downloads page.

In my experience performing a join with these tables to get the annotations is extremely fast and convenient. For example I use it to get the LOFTEE annotations on pLoF variants

cassimons commented 2 years ago

gnomAD has indeed generated various hail tables containing all the possible SNPs and annotated them with VEP

@tiboloic do you know if they have released the details of how this was generated? My google foo is coming up blank.

tiboloic commented 2 years ago

gnomAD has indeed generated various hail tables containing all the possible SNPs and annotated them with VEP

@tiboloic do you know if they have released the details of how this was generated? My google foo is coming up blank.

@cassimons the best I have found is the description in the supplementary material , page 31 of the gnomAD flagship paper, but it is very succinct. I pasted it below:

A dataset of every possible SNV in the human genome (2,858,658,098 sites x 3 substitutions at each site = 8,575,974,294 variants) along with 3 bases of genomic context was created using the GRCh37 reference. This dataset was annotated with methylation data for all CpG variants and coverage summaries as described above, and was subsequently used to annotate the exome and genome datasets where required downstream. The number of variants observed at each downsampling, broken down by variant class, is shown in Extended Data Fig. 4a. As previously shown , the CpG sites begin to saturate at a sample size of about 10,000 individuals, which affects the callset-wide transition/transversion (TiTv) ratio (Extended Data Fig. 4b). In order to compute the proportion of possible variants observed, we filtered the dataset of all possible SNVs to the exome calling intervals described previously4 and considered only bases where exome coverage was >= 30X (Extended Data Fig. 4c).