opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

ClinVar and UniProt inappropriate term inheritance #3450

Open Tobi1kenobi opened 2 months ago

Tobi1kenobi commented 2 months ago

As discussed with @d0choa and @DSuveges, there is an issue with the target prioritisation engine at least for ClinVar and UniProt where some terms are synonymous with ancestor terms rather than exclusively descendent terms. An example for this is ulcerative colitis which has as a parent term inflammatory bowel disease. For most things in the engine, this is correctly represented but for ClinVar and UniProt, genes are given high ulcerative colitis scores because of high inflammatory bowel disease scores.

A particularly egregious example of this is NOD2 which has a substantial amount of genetic evidence and functional follow-up pointing towards it being a Crohn's disease gene and NOT an ulcerative colitis gene. Yet the prioritisation engine prioritises NOD2 as the top gene for UC as a result of erroneous inheritance of IBD terms and strong influence of both ClinVar and Uniprot, see screenshot:

image

This should be resolved by ensuring only descendent terms are inherited for ClinVar and UniProt, as is the case with other elements of the prioritisation engine.

Tobi1kenobi commented 2 months ago

A related but likely trickier issue is that using all descendent terms may not be best suited to target prioritisation. Again using the above example of NOD2 with IBD vs Crohn's disease (CD): NOD2 is a Crohn's disease specific gene with a large amount of evidence supporting it. It is associated with IBD because of many strong Crohn's disease associations and IBD = CD + UC (rough enough). But for UC, NOD2 is not a valid target and roughly half to more than half of IBD cases are UC.

Despite it being irrelevant for about half of IBD individuals it is still prioritised by the target engine as the most relevant IBD gene well above TNF which has known pharmaceutical relevance for both UC and CD:

image

Another example might be BRCA2 being prioritsed for cancer above TP53 despite one being a mostly (as far as I know) breast cancer specific gene and the other being a pan-cancer gene:

image

Again, this is not as easily resolvable but possibly worth thinking about. When there is evidence taken from descendent terminology the scores could be weighted rather than simply calculated afresh i.e. IBD's NOD2 ClinVar score is some linear combination of the UC NOD2 and CD NOD2 scores rather than just a mirror of the CD NOD2 score. How this would propagate down through terms is unclear but the overarching point is that, to me, it seems broad terms should have broad targets up-weighted rather than highlighting narrow targets with lots of evidence.

DSuveges commented 2 months ago

Just confirming, that the tool the data team uses to map disease label to EFO ontology (OnToMa) maps 'inflammatory bowel disease 1' to the right MONDO term (MONDO_0009960).

data = [('inflammatory bowel disease 1',),]
df = (
    spark.createDataFrame(data, ['diseaseFromSource'])
    .withColumn('diseaseFromSourceId', f.lit(None).cast(t.StringType()))
)

add_efo_mapping(df, spark, ontoma_cache_dir = ontoma_cache,).show(truncate=False)

Gives:

+----------------------------+-------------------+-------------------------+
|diseaseFromSource           |diseaseFromSourceId|diseaseFromSourceMappedId|
+----------------------------+-------------------+-------------------------+
|inflammatory bowel disease 1|null               |MONDO_0009960            |
+----------------------------+-------------------+-------------------------+
Tobi1kenobi commented 2 months ago

If inflammatory bowel disease 1 is the cause of this issue I would propose not using this MONDO term and/or potentially contacting MONDO to correct this. We can consult with IBD genetics experts prior to making this decision but I'm fairly confident ulcerative colitis should not be a child of a term with the description: Any inflammatory bowel disease in which the cause of the disease is a mutation in the NOD2 gene. That's incorrect.

Good to know it potentially is just a UC problem rather than a wider problem.

d0choa commented 2 months ago

If I understood everything correctly, the problem is not with the ontology.

The problem is that EVA mapped Clinvar evidence containing inflammatory bowel disease 1 as a diseaseFromSource to the ulcerative colitis instead of the appropriate MONDO term MONDO_0009960.

It should be a matter of correcting this mapping in that pipeline. @DSuveges please let them know

d0choa commented 2 months ago

An even more exciting exercise could be producing OnToMa mappings for all diseaseFromSource in ClinVar evidence and checking the inconsistencies. That would give us a sense of how much of an exception this incorrect mapping is.