Prepare data export from OT Genetics Portal for Ensembl VEP - schema definition

DSuveges commented 1 year ago

Open Targets' locus to gene (l2g) prediction can inform VEP users if a queried variant is part of a GWAS loci and if that loci can be linked to a gene.

The scope of this ticket includes:

[ ] Solidify the expected data format - do we need to wrap study/disease info.
[ ] Solidify the output schema so downstream work on VEP team's side is unblocked.
[ ] Prototype logic, however productionized version of the code is not expected to be prepared due to the ongoing efforts on the genetics portal.

Formal requirements:

No schema in place to comply with.
Variants are expected to be on GRCh38 (chr, pos, alt/ref alleles)
Genes are identified by their Ensembl stable id.
To avoid losing associations because of obsoleted identifiers, gene symbols are also added. (based on Platform's target index from the corresponding Ensembl version)
No specific format is defined (json? parquet?)
No OT url is expected to be prepped if the link can be generated from components.
Recommended outptu: bgzipped, tabixed .tsv file.

Starting points:

locus/gene pairs returned only if the l2g score is above 0.5
l2g prediction is exploded to all variants of the credible set.
As a first iteration, we are only returning gene/variant pairs allowing linking to OT's variant page. The variant page shows all linked study/locus.

DSuveges commented 1 year ago

Process:

Identifying input datasets: locus2gene, finemapping credible sets, ld-expanded credible sets, target index.
Process l2g dataset: get study (study Id) -locus (lead variant id) 2 gene (ensembl gene id) triplets with l2g score >= 0.5
Read target index and join gene symbol and name to gene identifier. (verify if all gene ids are mapped)
Get all tag variants from the credible sets for each study/locus pairs from the fine-mapped datasets if available, or from the ld expanded dataset (here a left-anti join was performed to drop all finemapped studies in the ld expanded dataset)
credible sets are joined with l2g table by study id and lead variant id. (verify there's no lost study/lead pair)
For each credible set variant we keep all genes with the highest l2g score.

Gist prototyping exported dataset.

Output 1:

Completely exploded table: each variant are repeated as many times as many genes are associated with.

Schema:

root
 |-- chromosome: string (nullable = true)
 |-- position: long (nullable = true)
 |-- referenceAllele: string (nullable = true)
 |-- alternateAllele: string (nullable = true)
 |-- geneId: string (nullable = true)
 |-- geneSymbol: string (nullable = true)
 |-- geneName: string (nullable = true)
 |-- l2g: double (nullable = true)

Example:

-RECORD 0------------------------------
 chromosome      | 1                   
 position        | 196717788           
 referenceAllele | G                   
 alternateAllele | A                   
 geneId          | ENSG00000000971     
 geneSymbol      | CFH                 
 geneName        | complement factor H 
 l2g             | 0.5803238749504089  
only showing top 1 row

Output 2:

Data aggregated by variant: each variant in the table unique. And a list of associated gene object is added that contains the gene data.

Schema:

root
 |-- chromosome: string (nullable = true)
 |-- position: long (nullable = true)
 |-- referenceAllele: string (nullable = true)
 |-- alternateAllele: string (nullable = true)
 |-- associatedGenes: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- geneId: string (nullable = true)
 |    |    |-- geneSymbol: string (nullable = true)
 |    |    |-- geneName: string (nullable = true)
 |    |    |-- l2g: double (nullable = true)

Example:

-RECORD 0------------------------------------------------------------------------------------
 chromosome      | 1                                                                         
 position        | 1222304                                                                   
 referenceAllele | C                                                                         
 alternateAllele | T                                                                         
 associatedGenes | [{ENSG00000184163, C1QTNF12, C1q and TNF related 12, 0.5497661828994751}] 
only showing top 1 row

Stats

Number of variant to gene evidence: 1_309_708
Number of unique variants: 1_123_519
Number of unique genes: 11_177
Number of variants with multiple genes associated: 152_973

DSuveges commented 1 year ago

Feedback from the Ensembl team we should change:

Rename the file to finish in .gz instead of .bz (renaming is enough)
Run tabix in a way to include the file header: tabix -f -b2 -e2 -c chromosome OTGenetics.tsv.gz

d0choa commented 1 month ago

do we close this @DSuveges?

DSuveges commented 1 month ago

Yes. I think we can consider the schema finished. We'll open a ticket to set up the new process.

opentargets / issues