Variant index data for backend integration

DSuveges commented 1 week ago

The new variant annotation data is available:

~~JSON: gs://ot-team/dsuveges/variant_index_new_json (7.72 GiB)~~
~~Parquet: gs://ot-team/dsuveges/variant_index_new (1.01 GiB)~~
New data available here: gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/test_variant_index

Some stats:

The dataset contains 7.09M variants,
5.8M of them has allele frequencies (81%)
615k has no in-silico predictors (8.6%)
484k has no cross-references (6.8%)

Schema:

root
 |-- variantId: string (nullable = true)
 |-- chromosome: string (nullable = true)
 |-- position: integer (nullable = true)
 |-- referenceAllele: string (nullable = true)
 |-- alternateAllele: string (nullable = true)
 |-- inSilicoPredictors: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- method: string (nullable = true)
 |    |    |-- assessment: string (nullable = true)
 |    |    |-- score: float (nullable = true)
 |    |    |-- assessmentFlag: string (nullable = true)
 |    |    |-- targetId: string (nullable = true)
 |-- mostSevereConsequenceId: string (nullable = true)
 |-- transcriptConsequences: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- variantConsequenceIds: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- amino_acid_change: string (nullable = true)
 |    |    |-- uniprotAccessions: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- isEnsemblCanonical: boolean (nullable = true)
 |    |    |-- codons: string (nullable = true)
 |    |    |-- distance: long (nullable = true)
 |    |    |-- targetId: string (nullable = true)
 |    |    |-- impact: string (nullable = true)
 |    |    |-- transcriptId: string (nullable = true)
 |    |    |-- lofteePrediction: string (nullable = true)
 |    |    |-- siftPrediction: float (nullable = true)
 |    |    |-- polyphenPrediction: float (nullable = true)
 |-- rsIds: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- dbXrefs: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- source: string (nullable = true)
 |-- alleleFrequencies: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- populationName: string (nullable = true)
 |    |    |-- alleleFrequency: double (nullable = true)

Resolvable entities

In this dataset the following fields can be resolved to existing objects:

transcriptConsequences.variantConsequenceIds -> sequence ontology terms.
mostSevereConsequenceId -> sequence ontology term.
transcriptConsequences.targetId -> target identifier.

Other

For @gjmcn , some variants can have 5 different cross-references:

[
  {
    "id": "rs1801253",
    "source": "ensemblVariation"
  },
  {
    "id": "604878#0009",
    "source": "omim"
  },
  {
    "id": "VCV000017746",
    "source": "clinVar"
  },
  {
    "id": "10_114045297_G_C",
    "source": "protVar"
  },
  {
    "id": "10-114045297-G-C",
    "source": "gnomad"
  }
]

Link generation:

ensemblVariation: https://www.ensembl.org/Homo_sapiens/Variation/Explore?v={id}
gnomad: https://gnomad.broadinstitute.org/variant/{id}?dataset=gnomad_r4
protVar: needs parsing - https://www.ebi.ac.uk/ProtVar/query?chromosome={chr}&genomic_position={pos}&reference_allele={ref}&alternative_allele={alt}
clinVar: https://www.ncbi.nlm.nih.gov/clinvar/variation/{id}/
omim: https://www.omim.org/entry/{id}

DSuveges commented 1 week ago

@remo87 , A new variant dataset has been created here: gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/test_variant_index

Please use this one, as this dataset has been built by the most recent code + via the actual variant index pipeline implementation. The only mentionable difference is that GnomAD data is used to source other in-silico predictors (spiceai and pangolin) we cannot get from VEP.

  "inSilicoPredictors": [
    {
      "method": "phred scaled CADD",
      "score": 1.44,
      "targetId": "ENSG00000183495"
    },
    {
      "method": "spliceai",
      "score": 0.0
    },
    {
      "method": "pangolin",
      "score": 0.0
    }
  ],

remo87 commented 2 days ago

@DSuveges @d0choa I see in the schema that all the fields are marked as nullable. Is this the case? Are all fields other than the variant id optional?

DSuveges commented 2 days ago

The following fields are expected to be mandatory:

 |-- variantId: string (nullable = true)
 |-- chromosome: string (nullable = true)
 |-- position: integer (nullable = true)
 |-- referenceAllele: string (nullable = true)
 |-- alternateAllele: string (nullable = true)
 |-- mostSevereConsequenceId: string (nullable = true)

The consequence term is always computable, it must be there. Also the genomic location and alleles are expected to be present.

remo87 commented 1 day ago

I just finished the implementation of the change I'm going to deploy to dev so that the change can be tested

remo87 commented 14 hours ago

The new environment is deployed in the url variant.dev.opentargets.xyz. This is a sample query:

# Write your query or mutation here
query {
  variant(variantId:"11_108288992_T_C"){
    variantId
    chromosome
    position
    referenceAllele
    alternateAllele
    inSilicoPredictors{
      method
      assessment
      score
      assessmentFlag
      targetId
    }
    mostSevereConsequenceId
    transcriptConsequences {
      variantConsequenceIds
      amino_acid_change
      uniprotAccessions
      isEnsemblCanonical
      codons
      distance
      targetId
      impact
      transcriptId
      lofteePrediction
      siftPrediction
      polyphenPrediction
    }
    rsIds
    dbXrefs {
      id
      source
    }
    alleleFrequencies {
      populationName
      alleleFrequency
    }
  }
}

d0choa commented 13 hours ago

The API is not returning a fraction of the variants in the dataset. This is an example of a missing variant

-RECORD 0---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 variantId               | 1_209770915_G_T
 alleleFrequencies       | [{afr_adj, 3.897817222355481E-4}, {ami_adj, 0.0}, {amr_adj, 4.8348106365834006E-4}, {asj_adj, 0.0}, {eas_adj, 0.0013178703215603585}, {fin_adj, 0.0012318642211880647}, {mid_adj, 0.0}, {nfe_adj, 1.335247187635611E-4}, {remaining_adj, 0.0}, {sas_adj, 0.0014212620807276862}]
 position                | 209770915
 referenceAllele         | G
 alternateAllele         | T
 inSilicoPredictors      | [{phred scaled CADD, null, 0.448, null, ENSG00000009790}, {spliceai, null, 0.0, null, null}, {pangolin, null, null, null, null}]
 mostSevereConsequenceId | SO_0001627
 transcriptConsequences  | [{[SO_0001627], null, [Q9Y228], true, null, null, ENSG00000009790, MODIFIER, ENST00000367025, null, null, null}]
 rsIds                   | [rs1468069866]
 dbXrefs                 | [{rs1468069866, ensemblVariation}, {1-209770915-G-T, gnomad}]
only showing top 1 row

opentargets / issues