opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Prepare data export from OT Genetics Portal for Ensembl VEP - schema definition #2984

Closed DSuveges closed 1 month ago

DSuveges commented 1 year ago

Open Targets' locus to gene (l2g) prediction can inform VEP users if a queried variant is part of a GWAS loci and if that loci can be linked to a gene.

The scope of this ticket includes:

Formal requirements:

Starting points:

DSuveges commented 1 year ago

Process:

  1. Identifying input datasets: locus2gene, finemapping credible sets, ld-expanded credible sets, target index.
  2. Process l2g dataset: get study (study Id) -locus (lead variant id) 2 gene (ensembl gene id) triplets with l2g score >= 0.5
  3. Read target index and join gene symbol and name to gene identifier. (verify if all gene ids are mapped)
  4. Get all tag variants from the credible sets for each study/locus pairs from the fine-mapped datasets if available, or from the ld expanded dataset (here a left-anti join was performed to drop all finemapped studies in the ld expanded dataset)
  5. credible sets are joined with l2g table by study id and lead variant id. (verify there's no lost study/lead pair)
  6. For each credible set variant we keep all genes with the highest l2g score.

Gist prototyping exported dataset.

Output 1:

Completely exploded table: each variant are repeated as many times as many genes are associated with.

Schema:

root
 |-- chromosome: string (nullable = true)
 |-- position: long (nullable = true)
 |-- referenceAllele: string (nullable = true)
 |-- alternateAllele: string (nullable = true)
 |-- geneId: string (nullable = true)
 |-- geneSymbol: string (nullable = true)
 |-- geneName: string (nullable = true)
 |-- l2g: double (nullable = true)

Example:

-RECORD 0------------------------------
 chromosome      | 1                   
 position        | 196717788           
 referenceAllele | G                   
 alternateAllele | A                   
 geneId          | ENSG00000000971     
 geneSymbol      | CFH                 
 geneName        | complement factor H 
 l2g             | 0.5803238749504089  
only showing top 1 row

Output 2:

Data aggregated by variant: each variant in the table unique. And a list of associated gene object is added that contains the gene data.

Schema:

root
 |-- chromosome: string (nullable = true)
 |-- position: long (nullable = true)
 |-- referenceAllele: string (nullable = true)
 |-- alternateAllele: string (nullable = true)
 |-- associatedGenes: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- geneId: string (nullable = true)
 |    |    |-- geneSymbol: string (nullable = true)
 |    |    |-- geneName: string (nullable = true)
 |    |    |-- l2g: double (nullable = true)

Example:

-RECORD 0------------------------------------------------------------------------------------
 chromosome      | 1                                                                         
 position        | 1222304                                                                   
 referenceAllele | C                                                                         
 alternateAllele | T                                                                         
 associatedGenes | [{ENSG00000184163, C1QTNF12, C1q and TNF related 12, 0.5497661828994751}] 
only showing top 1 row

Stats

DSuveges commented 1 year ago

Feedback from the Ensembl team we should change:

d0choa commented 1 month ago

do we close this @DSuveges?

DSuveges commented 1 month ago

Yes. I think we can consider the schema finished. We'll open a ticket to set up the new process.