opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Variant index data for backend integration #3350

Open DSuveges opened 1 week ago

DSuveges commented 1 week ago

The new variant annotation data is available:

Some stats:

Schema:

root
 |-- variantId: string (nullable = true)
 |-- chromosome: string (nullable = true)
 |-- position: integer (nullable = true)
 |-- referenceAllele: string (nullable = true)
 |-- alternateAllele: string (nullable = true)
 |-- inSilicoPredictors: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- method: string (nullable = true)
 |    |    |-- assessment: string (nullable = true)
 |    |    |-- score: float (nullable = true)
 |    |    |-- assessmentFlag: string (nullable = true)
 |    |    |-- targetId: string (nullable = true)
 |-- mostSevereConsequenceId: string (nullable = true)
 |-- transcriptConsequences: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- variantConsequenceIds: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- amino_acid_change: string (nullable = true)
 |    |    |-- uniprotAccessions: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- isEnsemblCanonical: boolean (nullable = true)
 |    |    |-- codons: string (nullable = true)
 |    |    |-- distance: long (nullable = true)
 |    |    |-- targetId: string (nullable = true)
 |    |    |-- impact: string (nullable = true)
 |    |    |-- transcriptId: string (nullable = true)
 |    |    |-- lofteePrediction: string (nullable = true)
 |    |    |-- siftPrediction: float (nullable = true)
 |    |    |-- polyphenPrediction: float (nullable = true)
 |-- rsIds: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- dbXrefs: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- source: string (nullable = true)
 |-- alleleFrequencies: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- populationName: string (nullable = true)
 |    |    |-- alleleFrequency: double (nullable = true)

Resolvable entities

In this dataset the following fields can be resolved to existing objects:

Other

For @gjmcn , some variants can have 5 different cross-references:

[
  {
    "id": "rs1801253",
    "source": "ensemblVariation"
  },
  {
    "id": "604878#0009",
    "source": "omim"
  },
  {
    "id": "VCV000017746",
    "source": "clinVar"
  },
  {
    "id": "10_114045297_G_C",
    "source": "protVar"
  },
  {
    "id": "10-114045297-G-C",
    "source": "gnomad"
  }
]

Link generation:

DSuveges commented 1 week ago

@remo87 , A new variant dataset has been created here: gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/test_variant_index

Please use this one, as this dataset has been built by the most recent code + via the actual variant index pipeline implementation. The only mentionable difference is that GnomAD data is used to source other in-silico predictors (spiceai and pangolin) we cannot get from VEP.

  "inSilicoPredictors": [
    {
      "method": "phred scaled CADD",
      "score": 1.44,
      "targetId": "ENSG00000183495"
    },
    {
      "method": "spliceai",
      "score": 0.0
    },
    {
      "method": "pangolin",
      "score": 0.0
    }
  ],
remo87 commented 2 days ago

@DSuveges @d0choa I see in the schema that all the fields are marked as nullable. Is this the case? Are all fields other than the variant id optional?

DSuveges commented 2 days ago

The following fields are expected to be mandatory:

 |-- variantId: string (nullable = true)
 |-- chromosome: string (nullable = true)
 |-- position: integer (nullable = true)
 |-- referenceAllele: string (nullable = true)
 |-- alternateAllele: string (nullable = true)
 |-- mostSevereConsequenceId: string (nullable = true)

The consequence term is always computable, it must be there. Also the genomic location and alleles are expected to be present.

remo87 commented 1 day ago

I just finished the implementation of the change I'm going to deploy to dev so that the change can be tested

remo87 commented 14 hours ago

The new environment is deployed in the url variant.dev.opentargets.xyz. This is a sample query:

# Write your query or mutation here
query {
  variant(variantId:"11_108288992_T_C"){
    variantId
    chromosome
    position
    referenceAllele
    alternateAllele
    inSilicoPredictors{
      method
      assessment
      score
      assessmentFlag
      targetId
    }
    mostSevereConsequenceId
    transcriptConsequences {
      variantConsequenceIds
      amino_acid_change
      uniprotAccessions
      isEnsemblCanonical
      codons
      distance
      targetId
      impact
      transcriptId
      lofteePrediction
      siftPrediction
      polyphenPrediction
    }
    rsIds
    dbXrefs {
      id
      source
    }
    alleleFrequencies {
      populationName
      alleleFrequency
    }
  }
}
d0choa commented 13 hours ago

The API is not returning a fraction of the variants in the dataset. This is an example of a missing variant

-RECORD 0---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 variantId               | 1_209770915_G_T
 alleleFrequencies       | [{afr_adj, 3.897817222355481E-4}, {ami_adj, 0.0}, {amr_adj, 4.8348106365834006E-4}, {asj_adj, 0.0}, {eas_adj, 0.0013178703215603585}, {fin_adj, 0.0012318642211880647}, {mid_adj, 0.0}, {nfe_adj, 1.335247187635611E-4}, {remaining_adj, 0.0}, {sas_adj, 0.0014212620807276862}]
 position                | 209770915
 referenceAllele         | G
 alternateAllele         | T
 inSilicoPredictors      | [{phred scaled CADD, null, 0.448, null, ENSG00000009790}, {spliceai, null, 0.0, null, null}, {pangolin, null, null, null, null}]
 mostSevereConsequenceId | SO_0001627
 transcriptConsequences  | [{[SO_0001627], null, [Q9Y228], true, null, null, ENSG00000009790, MODIFIER, ENST00000367025, null, null, null}]
 rsIds                   | [rs1468069866]
 dbXrefs                 | [{rs1468069866, ensemblVariation}, {1-209770915-G-T, gnomad}]
only showing top 1 row

image