opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Investigate scores available in associations data files versus API #1508

Closed andrewhercules closed 3 years ago

andrewhercules commented 3 years ago

A user has reported that the scores in the associationByOverallDirect JSON file does not match with the scores available in the API and presented on the associations page.

For example, the overall association score returned by the API for BRAF and Noonan syndrome is 0.85 but the overallDatasourceHarmonicScore in the JSON file is 0.9781107755829519

part-00186-ecc3d41f-c4e5-42c5-a5a4-b4de41f749d4-c000.json:

{"diseaseId":"Orphanet_648","targetId":"ENSG00000157764","diseaseLabel":"Noonan syndrome","targetName":"B-Raf proto-oncogene, serine/threonine kinase","targetSymbol":"BRAF","overallDatasourceHarmonicScore":0.9781107755829519,"overallDatatypeHarmonicScore":0.9960553490155523,"overallDatasourceHarmonicVector":[{"datasourceId":"eva","datasourceHarmonicScore":1.0134063342019686,"datasourceEvidenceCount":100,"weight":1.0},{"datasourceId":"europepmc","datasourceHarmonicScore":0.08257571794867775,"datasourceEvidenceCount":12,"weight":0.2},{"datasourceId":"uniprot_variants","datasourceHarmonicScore":1.0,"datasourceEvidenceCount":3,"weight":1.0},{"datasourceId":"clingen","datasourceHarmonicScore":0.5,"datasourceEvidenceCount":1,"weight":1.0},{"datasourceId":"genomics_england","datasourceHarmonicScore":0.9909172408129521,"datasourceEvidenceCount":9,"weight":1.0},{"datasourceId":"reactome","datasourceHarmonicScore":1.0,"datasourceEvidenceCount":7,"weight":1.0}],"overallDatatypeHarmonicVector":[{"datatypeId":"genetic_association","datatypeHarmonicScore":1.0150574935620538,"datatypeEvidenceCount":113,"weight":1.0},{"datatypeId":"affected_pathway","datatypeHarmonicScore":1.0,"datatypeEvidenceCount":7,"weight":1.0},{"datatypeId":"literature","datatypeHarmonicScore":0.01651514358973555,"datatypeEvidenceCount":12,"weight":1.0}],"overallDatasourceEvidenceCount":132.0,"overallDatatypeEvidenceCount":132.0}

Can we please investigate the difference in the score returned by the API and the score available in the data downloads file?

In the meantime, I will respond to the user that we are investigating the discrepancy

sigven commented 3 years ago

Any progress wrt understanding the cause for this bug?

andrewhercules commented 3 years ago

Hi @sigven!

Our data and technical teams have investigated the issue. Both the data available in the user interface and the associations datasets available for download are correct and valid. However, the difference between them is due to a slightly different algorithm and normalisation and harmonic sum strategy. We expect that the ranking between the user interface and the datasets will be broadly similar, but there will be some differences due to the different algorithms.

We will be harmonising our approach with our next release — 21.06 — scheduled for release at the end of June. This will mean that both the user interface and datasets will provide the same data.

sigven commented 3 years ago

Great @andrewhercules! Thanks for looking into this, highly appreciated. I will use the 21.04 data meanwhile, looking forward to the 21.06 release.

regards, Sigve

andrewhercules commented 3 years ago

Ticket closed as bug has been resolved and new associations files have been generated and made available via FTP and BigQuery