Extract V2G evidence from functional predictions

opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal

https://platform.opentargets.org https://genetics.opentargets.org

Apache License 2.0

12 stars 2 forks source link

Extract V2G evidence from functional predictions #2789

Closed ireneisdoomed closed 1 year ago

ireneisdoomed commented 1 year ago

One of the sources for the V2G dataset we have in production is the relationship between a variant and the impact that is predicted to have on the transcript.

This information is predicted by VEP and it is available in the variant annotation dataset that we extract from gnomad. On top of the most severe functional consequence, there is more functional annotation that we think is valuable to display. Therefore the new dataset will include variant/gene information from different angles:

What is the most severe effect of the variant per gene?
What is the predicted polyphen score of the variant on a gene?
What is the predicted sift score of the variant on a gene?
Is the variant predicted to cause a loss of function in a gene

ireneisdoomed commented 1 year ago

Most severe functional consequence

The logic is:

extract all predicted functional consequences for each gene based on its canonical transcript
get the most severe one based in a hardcoded score provided in gs://genetics-portal-data/lut/vep_consequences.tsv

The score is already normalised between 0 and 1.

Example of a V2G evidence:

 geneId                         | ENSG00000285823
 resourceScore                  | null
 datasourceId                   | variantConsequence
 datatypeId                     | vep
 pmid                           | null
 biofeature                     | null
 score                          | 1
 variantId                      | 1_25043903_G_A
 label                          | splice_donor_variant
 variantFunctionalConsequenceId | SO_0001575
 isHighQualityPlof              | null
 chromosome                     | 1

ireneisdoomed commented 1 year ago

Polyphen score

The logic consists of simply parsing the VEP object. The score is already normalised between 0 and 1.

Example of a V2G evidence:

 geneId                         | ENSG00000116675
 resourceScore                  | null
 datasourceId                   | polyphen
 datatypeId                     | vep
 pmid                           | null
 biofeature                     | null
 score                          | 0.005
 variantId                      | 1_65401836_G_A
 label                          | benign
 variantFunctionalConsequenceId | null
 isHighQualityPlof              | null
 chromosome                     | 1

ireneisdoomed commented 1 year ago

SIFT score

The logic consists of simply parsing the VEP object. The score is already normalised between 0 and 1 with the exception that this must be interpreted inversely than Polyphen. That is, the closer the score is to 0, the higher the probability that a substitution is damaging. So under resourceScore we will keep the actual sift score, and under score the inverted one that will feed the aggregated V2G score.

Example of a V2G evidence:

 geneId                         | ENSG00000117724
 resourceScore                  | 1.0
 datasourceId                   | sift
 datatypeId                     | vep
 pmid                           | null
 biofeature                     | null
 score                          | 0.0
 variantId                      | 1_214637901_C_G
 label                          | tolerated
 variantFunctionalConsequenceId | null
 isHighQualityPlof              | null
 chromosome                     | 1

ireneisdoomed commented 1 year ago

pLOF assesment

The logic consists of:

extracting the pLOF flag that is based on the LOFTEE VEP plugin. For the eligible variants, this indicates if the variant is predicted to cause a LOF with a high or low confidence.
score the evidence based on that confidence:
- high confidence evidence: score of 1
- low confidence evidence: score of 0 - this will mean that we show the flag, however this evidence will not be weighted in the v2g aggregated score

Example of a V2G evidence:

 geneId                         | ENSG00000136536
 resourceScore                  | null
 datasourceId                   | loftee
 datatypeId                     | vep
 pmid                           | null
 biofeature                     | null
 score                          | 0
 variantId                      | 2_159714599_G_A
 label                          | null
 variantFunctionalConsequenceId | null
 isHighQualityPlof              | false
 chromosome                     | 2