Open DSuveges opened 1 week ago
@remo87 , A new variant dataset has been created here: gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/test_variant_index
Please use this one, as this dataset has been built by the most recent code + via the actual variant index pipeline implementation. The only mentionable difference is that GnomAD data is used to source other in-silico predictors (spiceai and pangolin) we cannot get from VEP.
"inSilicoPredictors": [
{
"method": "phred scaled CADD",
"score": 1.44,
"targetId": "ENSG00000183495"
},
{
"method": "spliceai",
"score": 0.0
},
{
"method": "pangolin",
"score": 0.0
}
],
@DSuveges @d0choa I see in the schema that all the fields are marked as nullable. Is this the case? Are all fields other than the variant id optional?
The following fields are expected to be mandatory:
|-- variantId: string (nullable = true)
|-- chromosome: string (nullable = true)
|-- position: integer (nullable = true)
|-- referenceAllele: string (nullable = true)
|-- alternateAllele: string (nullable = true)
|-- mostSevereConsequenceId: string (nullable = true)
The consequence term is always computable, it must be there. Also the genomic location and alleles are expected to be present.
I just finished the implementation of the change I'm going to deploy to dev so that the change can be tested
The new environment is deployed in the url variant.dev.opentargets.xyz
. This is a sample query:
# Write your query or mutation here
query {
variant(variantId:"11_108288992_T_C"){
variantId
chromosome
position
referenceAllele
alternateAllele
inSilicoPredictors{
method
assessment
score
assessmentFlag
targetId
}
mostSevereConsequenceId
transcriptConsequences {
variantConsequenceIds
amino_acid_change
uniprotAccessions
isEnsemblCanonical
codons
distance
targetId
impact
transcriptId
lofteePrediction
siftPrediction
polyphenPrediction
}
rsIds
dbXrefs {
id
source
}
alleleFrequencies {
populationName
alleleFrequency
}
}
}
The API is not returning a fraction of the variants in the dataset. This is an example of a missing variant
-RECORD 0---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
variantId | 1_209770915_G_T
alleleFrequencies | [{afr_adj, 3.897817222355481E-4}, {ami_adj, 0.0}, {amr_adj, 4.8348106365834006E-4}, {asj_adj, 0.0}, {eas_adj, 0.0013178703215603585}, {fin_adj, 0.0012318642211880647}, {mid_adj, 0.0}, {nfe_adj, 1.335247187635611E-4}, {remaining_adj, 0.0}, {sas_adj, 0.0014212620807276862}]
position | 209770915
referenceAllele | G
alternateAllele | T
inSilicoPredictors | [{phred scaled CADD, null, 0.448, null, ENSG00000009790}, {spliceai, null, 0.0, null, null}, {pangolin, null, null, null, null}]
mostSevereConsequenceId | SO_0001627
transcriptConsequences | [{[SO_0001627], null, [Q9Y228], true, null, null, ENSG00000009790, MODIFIER, ENST00000367025, null, null, null}]
rsIds | [rs1468069866]
dbXrefs | [{rs1468069866, ensemblVariation}, {1-209770915-G-T, gnomad}]
only showing top 1 row
The new variant annotation data is available:
JSON:gs://ot-team/dsuveges/variant_index_new_json
(7.72 GiB)Parquet:gs://ot-team/dsuveges/variant_index_new
(1.01 GiB)gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/test_variant_index
Some stats:
Schema:
Resolvable entities
In this dataset the following fields can be resolved to existing objects:
transcriptConsequences.variantConsequenceIds
-> sequence ontology terms.mostSevereConsequenceId
-> sequence ontology term.transcriptConsequences.targetId
-> target identifier.Other
For @gjmcn , some variants can have 5 different cross-references:
Link generation:
https://www.ensembl.org/Homo_sapiens/Variation/Explore?v={id}
https://gnomad.broadinstitute.org/variant/{id}?dataset=gnomad_r4
https://www.ebi.ac.uk/ProtVar/query?chromosome={chr}&genomic_position={pos}&reference_allele={ref}&alternative_allele={alt}
https://www.ncbi.nlm.nih.gov/clinvar/variation/{id}/
https://www.omim.org/entry/{id}