opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Enrich locus to gene prediction dataset with features #3608

Open DSuveges opened 2 weeks ago

DSuveges commented 2 weeks ago

The locus to gene prediction dataset is loaded to Platform. This dataset is used to populate credible set widget and locus to gene table. This table at this point only contains study locus id, gene id, and locus to gene score. This table needs to be enriched with the feature matrix. The column is expected to be a map type that allows the dynamic increase of the list of features without a required schema change.

Schema:

root
 |-- studyLocusId: string (nullable = true)
 |-- geneId: string (nullable = true)
 |-- locusToGeneFeatures: map (nullable = false)
 |    |-- key: string
 |    |-- value: float (valueContainsNull = true)

Example:

-RECORD 0-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 studyLocusId        | bdb5275d1556f592edb5de40c6a35aa7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
 geneId              | ENSG00000103994                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
 locusToGeneFeatures | {credibleSetConfidence -> 0.75, distanceFootprintMean -> 0.6429367, distanceFootprintMeanNeighbourhood -> -0.048183195, distanceSentinelFootprint -> 0.6429367, distanceSentinelFootprintNeighbourhood -> -0.048183195, distanceSentinelTss -> 0.6429367, distanceSentinelTssNeighbourhood -> -0.04694459, distanceTssMean -> 0.6429367, distanceTssMeanNeighbourhood -> -0.04694459, eQtlColocClppMaximum -> null, eQtlColocClppMaximumNeighbourhood -> null, eQtlColocH4Maximum -> null, eQtlColocH4MaximumNeighbourhood -> null, geneCount500kb -> null, isProteinCoding -> 1.0, pQtlColocClppMaximum -> null, pQtlColocClppMaximumNeighbourhood -> null, pQtlColocH4Maximum -> null, pQtlColocH4MaximumNeighbourhood -> null, proteinGeneCount500kb -> null, sQtlColocClppMaximum -> null, sQtlColocClppMaximumNeighbourhood -> null, sQtlColocH4Maximum -> null, sQtlColocH4MaximumNeighbourhood -> null, vepMaximum -> 0.0, vepMaximumNeighbourhood -> -0.0055555557, vepMean -> 0.0, vepMeanNeighbourhood -> -0.0055555557} 
only showing top 1 row

The column needs to be added at the l2g prediction step.

DSuveges commented 2 weeks ago

The code to generate the new dataset is merged and is now part of the L2G prediction step. The dataset:

gs://ot_orchestration/releases/24.10_freeze6/locus_to_gene_predictions_w_features
ireneisdoomed commented 2 weeks ago

The annotation went fine. Same scores between the annotated and non annotated dataset, and same number of credible sets. No rows with null features. In terms of size, dataset has doubled from 600Mb to 1.3Gb.