opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Consider downweighting evidence from inherited terms in target prioritisation #3457

Open Tobi1kenobi opened 1 week ago

Tobi1kenobi commented 1 week ago

Currently, I believe, evidence for a disease term A and its children terms B and C or grandchildren terms D and E are considered equally when calculating the target score for disease A. This means that targets that are highly specific to descendent terms (e.g. B, C, D and E) can sometime be prioritised for the disease term A even if they're only relevant for a narrow subset of individuals with disease term A.

Two practical examples:

NOD2 is a Crohn's disease specific gene with a large amount of evidence supporting it. It is associated with IBD because of many strong Crohn's disease associations and IBD = CD + UC (rough enough). But for UC, NOD2 is not a valid target and roughly half to more than half of IBD cases are UC.

Despite it being irrelevant for about half of IBD individuals it is still prioritised by the target engine as the most relevant IBD gene well above TNF which has known pharmaceutical relevance for both UC and CD:

image

Another example might be BRCA2 being prioritised for cancer above TP53 despite one being a mostly, as far as I know, breast cancer specific gene and the other being a pan-cancer gene:

image

A solution to this, as discussed with @d0choa and @DSuveges, could be downweighting evidence when it is inherited. Either considering the ontology structure (evidence from children is downweighted, evidence from grandchildren is downweighted even more, etc) or simply applying a blanket downweighting to all evidence from descendent terms. The latter of which may be much simpler to implement.

This would require benchmarking to ensure no strange behaviour occurs as a result of this.

Originally posted by @Tobi1kenobi in https://github.com/opentargets/issues/issues/3450#issuecomment-2342056695

d0choa commented 1 week ago

@mjfalaguera (and probably others) has an offline implementation of the platform scoring. It should be relatively easy to produce a new association dataset with the suggested change for benchmarking purposes.

d0choa commented 1 week ago

Also, to add some background here:

This issue has come out a few times. In the past, Ian Dunham commented that they never had the time in the early days to explore this option, but with the appropriate benchmark, it could improve the prioritisation.

This could represent a solution to a different problem. We show only 'direct' or 'indirect' associations depending on which page you are looking at - target or disease - because of precisely the same problem. Evidence on descendant terms can overtake the association score. That makes association scores of high-level terms much higher than any descendant term because all evidence is equally scored.

By penalising propagated evidence we could solve two issues at the same time

mjfalaguera commented 4 days ago

for the scoring implementation I use, see https://github.com/opentargets/timeseries/blob/main/timeseries.py