opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Modelling new variant annotation dataset for variant page #3333

Open DSuveges opened 3 weeks ago

DSuveges commented 3 weeks ago

To provide variant level metadata for the new variant page, variant annotation needs to be generated for all variants we have any phenotypic information. We decided to drop GnomAD as a source of variant annotation as it might not be complete eg. most likely misses rare variants that we ingest from ClinVar. Instead we pool all variant at the variant index generation step and generate annotation via Ensembl's VEP. This ticket aims to capture all decisions and around the generation of the variant annotation dataset.

Tasks

DSuveges commented 3 weeks ago

Running VEP

The AirFlow layer will manage the VEP annotation process via Google Batch jobs. It has been prototyped and can be successfully run for ~5M variants in around 1hour. The code to containerise and execute the annotation is this PR: https://github.com/opentargets/gentropy/pull/608/files

DSuveges commented 3 weeks ago

Transcript selection

Dropping non-canonical transcripts. All downstream analysis will only based on canonical transcripts.

in-silico predictors

Sone of the in-silico predictors need to come with the corresponding target identifier.

      ├───inSilicoPredictors: array 
      │   ├───element: struct 
      │   │   ├───method : string
      │   │   ├───assessment : string
      │   │   ├───flag : string
      │   │   ├───score : float
      │   │   ├───targetId : string

Where do we want target identifier:

  1. Alpha missense
  2. Sift, Polyphen
  3. Loftee

Cross-refs

We can extract cross-refs pointing to OMIM, ClinVar and ensembl.

DSuveges commented 3 weeks ago

Proposed changes in the context of other gentropy processes

  1. The GnomAD pre-process will not make variant annotation dataset
  2. The pre-process should yield a new datatype called variant list.
  3. This variant list is expected to have variant id, cross ref and allele frequency columns. This datatype needs to have its own schema.
  4. This dataset should have the method to output data as vcf