Update the Platform's scoring documentation

HelenaCornu commented 2 years ago

Background

Several users, either through the Open Targets Community or the helpdesk, have asked us to clarify how the data in the Platform is scored.

The Platform documentation explains how scores are obtained for individual data sources, but does not currently contain enough information for the users to be able to reproduce the data type or overall data scoring, and so needs to be updated.

Tasks

Update the documentation. This could include:

How evidence scores are normalised (e.g. for Europe PMC, resource scores are normalised between 0 and 1)
How the theoretical maximum is handled at different scoring levels (and why)
Direct/indirect scoring and exceptions (Expression Atlas)
(Scoring in the PPP)

adamsardar commented 1 year ago

Perhaps related: I was reading your recent paper 'Open Targets Platform: new developments and updates two years on', which makes frequent mention of the documentation link https://docs.targetvalidation.org/getting-started/scoring. However, it does not lead anywhere! Just a gitbook account that I do not have access too. I'm really keen to understand how the overall score is calculated and to get a feeling for if higher scores are more certain than lower scores.

Does a better link for the documentation exist?

Thank you for all of your hard work compiling the information available through OpenTargets. You have saved me so much time over the last few years!

iandunham commented 1 year ago

The content of that page is now distributed across

https://platform-docs.opentargets.org/evidence

https://platform-docs.opentargets.org/associations

d0choa commented 1 year ago

@mjfalaguera do you think you could help us with this? The missing things are the bits that you have investigated, like the division by 1.6, etc.

mjfalaguera commented 1 year ago

Associations scores by data source are calculated and normalised as follows:

For each disease-target-datasource triplet all the pieces of evidence available in the Platform and their evidence scores are collected, sorted in descending order and assigned an incremental value that indicates its position in the sorted list. e.g.:

evidence with score=1.0 -> positional id=1
evidence with score=0.9 -> positional id=2
evidence with score=0.8 -> positional id=3

the harmonic sum score for each disease-target-datasource triplet is then calculated by summing the result of dividing each evidence score by the score positional id to the power of 2, as follows:

harmonic sum score = 1.0/1^2 + 0.9/2^2 + 0.8/3^2

this harmonic sum score is then normalised by dividing it by the theoretical maximum score we could get given 100000 repeated evidence with score equal 1.0 supporting that association:

max. theoretical harmonic sum score = 1.0/1^2 + 1.0/2^2 + 1.0/3^2 + ... + 1.0/100000^2

so the resulting normalised harmonic sum score is equal to:

normalised harmonic sum score = harmonic sum score / max. theoretical harmonic sum score = (1.0/1^2 + 0.9/2^2 + 0.8/3^2) / (1.0/1^2 + 1.0/2^2 + 1.0/3^2 + ... + 1.0/100000^2)

Overall associations scores are calculated and normalised as follows:

Once we've got the normalised harmonic sum score for each target-disease-datasource, we multiply them by the weigh of that datasource in the overall score, which to my knowledge are:

weights = [['cancer_biomarkers', 0.5],
            ['cancer_gene_census', 1],
            ['chembl', 1],
            ['clingen', 1],
            ['crispr', 1],
            ['encore', 0.5],
            ['europepmc', 0.2],
            ['eva', 1],
            ['eva_somatic', 1],
            ['expression_atlas', 0.2],
            ['gene2phenotype', 1],
            ['gene_burden', 1],
            ['genomics_england', 1],
            ['impc', 0.2],
            ['intogen', 1],
            ['orphanet', 1],
            ['ot_crispr', 0.5],           
            ['ot_crispr_validation', 0.5],           
            ['ot_genetics_portal', 1],
            ['progeny', 0.5],
            ['reactome', 1],
            ['slapenrich', 0.5],
            ['sysbio', 0.5],
            ['uniprot_literature', 1],
            ['uniprot_variants', 1]]

In a similar way as with the evidence scores, these weighed scores are then collected, sorted in descending order and assigned an incremental value that indicates its position in the sorted list. e.g.:

chembl evidence with score=0.9 * weight=1.0 = 0.9 -> positional id=1
orphanet evidence with score=0.8 * weight=1.0 = 0.8 -> positional id=2
europepmc evidence with score=1.0 * weight=0.2 = 0.2 -> positional id=3

and the harmonic sum score is again calculated by summing the result of dividing each weighed score by the score positional id to the power of 2, as follows:

overall harmonic sum score = 0.9/1^2 + 0.8/2^2 + 0.2/3^2

this harmonic sum score is then normalised, again, by dividing it by the theoretical maximum score we could get given 100000 repeated evidence with score equal 1.0 supporting that association:

normalised overall harmonic sum score = overall harmonic sum score / max. theoretical harmonic sum score = (0.9/1^2 + 0.8/2^2 + 0.2/3^2) /  (1.0/1^2 + 1.0/2^2 + 1.0/3^2 + ... + 1.0/100000^2)

Hope this helps clarify

HelenaCornu commented 1 year ago

@mjfalaguera Thank you so much for this clear and comprehensive explanation! 🌟

I think that what I meant by

Direct/indirect scoring and exceptions (Expression Atlas)

Is that we needed to expand and clarify how we calculate indirect scoring versus direct scoring because we often get questions about this. There is already an explanation in the documentation, but perhaps it could be improved.

Expression Atlas is an exception because evidence from this data source is excluded from indirect score calculations. We do note this in the documentation:

Note: RNA expression data type evidence is not propagated in the ontology. We made this decision to prevent parent terms from having long lists of associated targets with weak RNA expression association scores.

I don't know if indirect scoring is something you looked into?

mjfalaguera commented 1 year ago

In Open Targets resources, "The EMBL-EBI Experimental Factor Ontology (EFO) is used as scaffold for the disease or phenotype entity.” (see ref.). In this ontology, disease/phenotype entities are hierarchically classified and interconnected as disease/phenotype parent and children terms. For instance, “melanoma” is a parent term that includes amelatonic melanoma, spindle cell melanoma and cutaneous melanoma among other children terms.

Based on this hierarchy the way association scores are calculated in the Platform is as follows:

For a given disease, the score of association of a target is calculated by collecting all evidence directly linking the disease with the target and all evidence indirectly linking them, this is, all evidence linking this target with any of the children terms found for this disease. Then the formulas described above are applied to calculate the indirect score for this disease and this target. By default, association scores displayed in the web application when listing associated targets with a disease of interest are indirect scores (see ref.).

On the other hand, for a given target, the score of association of a disease is calculated by collecting only evidence directly linking the target with the disease. Then the formulas described above are applied to calculate the direct score for this target and this disease. By default, association scores displayed in the web application when listing associated diseases with a target of interest are direct scores (see ref.).

HelenaCornu commented 1 year ago

Update merge into the documentation

opentargets / issues

Update the Platform's scoring documentation #2552

Background

Tasks