Open DSuveges opened 3 weeks ago
The AirFlow layer will manage the VEP annotation process via Google Batch jobs. It has been prototyped and can be successfully run for ~5M variants in around 1hour. The code to containerise and execute the annotation is this PR: https://github.com/opentargets/gentropy/pull/608/files
gs://genetics_etl_python_playground/vep
(currently supports Ensembl v.111)Dropping non-canonical transcripts. All downstream analysis will only based on canonical transcripts.
Sone of the in-silico predictors need to come with the corresponding target identifier.
├───inSilicoPredictors: array
│ ├───element: struct
│ │ ├───method : string
│ │ ├───assessment : string
│ │ ├───flag : string
│ │ ├───score : float
│ │ ├───targetId : string
Where do we want target identifier:
We can extract cross-refs pointing to OMIM, ClinVar and ensembl.
To provide variant level metadata for the new variant page, variant annotation needs to be generated for all variants we have any phenotypic information. We decided to drop GnomAD as a source of variant annotation as it might not be complete eg. most likely misses rare variants that we ingest from ClinVar. Instead we pool all variant at the variant index generation step and generate annotation via Ensembl's VEP. This ticket aims to capture all decisions and around the generation of the variant annotation dataset.
Tasks
VariantAnnotation
schemaVariantAnnotation
schema dataset class. This class should have the following methods:.from_vep_json()
- to create variant annotation based on VEP output.VariantList
schema.