Isolate Annotation from Seqr loading pipeline

MattWellie commented 2 years ago

Currently the annotation step (within Dataproc) is not the sole source of annotations

As the process begins from a MT instead of a VCF, some of the annotations on that original MT are carried over instead of being updated. These are:

GnomAD/ExAC frequencies
CADD
REVEL
SpliceAI
Clinvar scores

The source for these annotations: https://github.com/populationgenomics/hail-elasticsearch-pipelines/tree/main/download_and_create_reference_datasets/v02 https://github.com/populationgenomics/hail-elasticsearch-pipelines/blob/main/download_and_create_reference_datasets/v02/hail_scripts/write_combined_reference_data_ht.py#L39-L48

The annotations are available as a single Hail Table, which is used to annotate the MatrixTable as a join.

For now - make sure the Annotation stage can supply these annotations, with as little dependence on external libraries as possible

Future - obtain all annotations from a single source, so that this bespoke preparation isn't required?

MattWellie commented 2 years ago

Plan with @vladsaveliev:

AIP to take 'ownership' of the VCF annotation component (methods present here)
Pipelines will import annotation function from here

This means:

AIP will be dependent on cpg_utils & cpg_pipes, but cpg_pipes is a separate sub-package within production-pipelines so no circular dependencies exist
AIP doesn't require the stage wrapper framework to use the annotation process, but methods are available as stage logic for other processes
Will remain dependent on preparation of input files for annotation, as currently defined in hail-elasticsearch-pipelines
- @tiboloic has some thoughts on using the Gnomad-generated annotation bundles directly, which is already in use elsewhere in the CPG. Requires checks on size of data, genome build, availability, and frequency of regeneration
Should probably generate a package wrapper for AIP once first draft is polished, so it can be imported from pypi by other processes

Port over the annotation function(s) and check that they work outside of production pipelines
- will require generation of a pure VCF from the ACG cohort MT for testing

MattWellie commented 2 years ago

Note - the current logic mixes 3 things:

annotating a VCF in chunks with VEP
applying those VEP annotations to a MT
applying VQSR annotations to a MT

Instead this will focus on parts 1 & 2, with any references to VQSR removed

tiboloic commented 2 years ago

in reply to this comment

gnomAD has indeed generated various hail tables containing all the possible SNPs and annotated them with VEP. Here is the location of some of them:

VEP 85 (GENCODE 19): gs://gnomad-public-requester-pays/resources/context/grch37_context_vep_annotated.ht
VEP 95 (GENCODE 29): gs://gnomad-public-requester-pays/resources/context/grch38_context_vep_annotated.ht
VEP 101 (GENCODE 35): gs://gnomad-public-requester-pays/resources/context/grch38_context_vep_annotated.v101.ht

The VEP 85 and 95 Tables are also available through various cloud providers' open datasets programs and can be downloaded without paying egress. More detail about how to access those can be found in this blog post and the gnomAD downloads page.

In my experience performing a join with these tables to get the annotations is extremely fast and convenient. For example I use it to get the LOFTEE annotations on pLoF variants

cassimons commented 2 years ago

gnomAD has indeed generated various hail tables containing all the possible SNPs and annotated them with VEP

@tiboloic do you know if they have released the details of how this was generated? My google foo is coming up blank.

tiboloic commented 2 years ago

gnomAD has indeed generated various hail tables containing all the possible SNPs and annotated them with VEP

@tiboloic do you know if they have released the details of how this was generated? My google foo is coming up blank.

@cassimons the best I have found is the description in the supplementary material , page 31 of the gnomAD flagship paper, but it is very succinct. I pasted it below:

A dataset of every possible SNV in the human genome (2,858,658,098 sites x 3 substitutions at each site = 8,575,974,294 variants) along with 3 bases of genomic context was created using the GRCh37 reference. This dataset was annotated with methylation data for all CpG variants and coverage summaries as described above, and was subsequently used to annotate the exome and genome datasets where required downstream. The number of variants observed at each downsampling, broken down by variant class, is shown in Extended Data Fig. 4a. As previously shown , the CpG sites begin to saturate at a sample size of about 10,000 individuals, which affects the callset-wide transition/transversion (TiTv) ratio (Extended Data Fig. 4b). In order to compute the proportion of possible variants observed, we filtered the dataset of all possible SNVs to the exome calling intervals described previously4 and considered only bases where exome coverage was >= 30X (Extended Data Fig. 4c).

populationgenomics / automated-interpretation-pipeline

Isolate Annotation from Seqr loading pipeline #9