Closed DSuveges closed 1 month ago
Process:
Gist prototyping exported dataset.
Completely exploded table: each variant are repeated as many times as many genes are associated with.
Schema:
root
|-- chromosome: string (nullable = true)
|-- position: long (nullable = true)
|-- referenceAllele: string (nullable = true)
|-- alternateAllele: string (nullable = true)
|-- geneId: string (nullable = true)
|-- geneSymbol: string (nullable = true)
|-- geneName: string (nullable = true)
|-- l2g: double (nullable = true)
Example:
-RECORD 0------------------------------
chromosome | 1
position | 196717788
referenceAllele | G
alternateAllele | A
geneId | ENSG00000000971
geneSymbol | CFH
geneName | complement factor H
l2g | 0.5803238749504089
only showing top 1 row
Data aggregated by variant: each variant in the table unique. And a list of associated gene object is added that contains the gene data.
Schema:
root
|-- chromosome: string (nullable = true)
|-- position: long (nullable = true)
|-- referenceAllele: string (nullable = true)
|-- alternateAllele: string (nullable = true)
|-- associatedGenes: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- geneId: string (nullable = true)
| | |-- geneSymbol: string (nullable = true)
| | |-- geneName: string (nullable = true)
| | |-- l2g: double (nullable = true)
Example:
-RECORD 0------------------------------------------------------------------------------------
chromosome | 1
position | 1222304
referenceAllele | C
alternateAllele | T
associatedGenes | [{ENSG00000184163, C1QTNF12, C1q and TNF related 12, 0.5497661828994751}]
only showing top 1 row
Feedback from the Ensembl team we should change:
.gz
instead of .bz
(renaming is enough)tabix -f -b2 -e2 -c chromosome OTGenetics.tsv.gz
do we close this @DSuveges?
Yes. I think we can consider the schema finished. We'll open a ticket to set up the new process.
Open Targets' locus to gene (l2g) prediction can inform VEP users if a queried variant is part of a GWAS loci and if that loci can be linked to a gene.
The scope of this ticket includes:
Formal requirements:
Starting points: