monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

some possibly surprising phenotype terms and some missing terms in beta Monarch data #824

Closed realmarcin closed 5 years ago

realmarcin commented 5 years ago

We are trying to debug an issue with our Translator modules which rely on Monarch data and ontobio semantic similarity.

Here are some notes from Colleen Xu:

Between Monday (8/26) and Thursday (8/29), there seems to have been a change to the phenotype annotation. Specifically, EFO terms are being counted as the only phenotype annotation for some genes in our queries and leading to odd results.

Queries for genes with shared phenotype annotations (context: Fanconi anemia) are returning associations like…

LINC00471 has a jaccard similarity score of 1 with FANCI. They both have EFO:0004339 (“body height” https://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fwww.ebi.ac.uk%2Fefo%2FEFO_0004339). This is odd because FANCI has many other phenotype annotations in Monarch (that aren’t shared by the LINC gene), so the score shouldn’t be one (https://monarchinitiative.org/gene/HGNC:25568#phenotypes).

ADIPOR1 has a jaccard similarity score of 1 with XRCC2. They both have EFO:0004584 (“mean platelet volume” https://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fwww.ebi.ac.uk%2Fefo%2FEFO_0004584).

The list of genes with this issue (only phenotype annotation returned are EFO terms) include: BRIP1 ERCC4 FANCB FANCC FANCD2 FANCE FANCI FANCL MAD2L2 RFWD3 SLX4 XRCC2

So it looks like something may have changed in Monarch phenotype data for these genes this week? I can't figure out if this is change is sensible or somehow these EFO terms shouldn't be considered a phenotype.

Regardless, one solution for us for now may be to simply filter the EFO terms out.

The other corollary is that biolink API seems to be already using the beta Monarch -- I assume there is no way to switch that via parameter settings.

kshefchek commented 5 years ago

discussed via gitter, but to answer here

Monarch had a major data release on August 29th. This included an update that removed many gene to phenotype associations that are inferred from variant/gene -> disease -> phenotype associations. The production database only includes disease to phenotype associations that have an obligate qualifier from the HPO annotations. I've adjusted the beta inferences to also the "very frequent" qualifier.

The more experimental join can still be done in biolink via multiple queries, eg

Get causal gene to disease associations

https://api.monarchinitiative.org/api/bioentity/gene/HGNC%3A3583/diseases?rows=100&facet=false&unselect_evidence=false&exclude_automatic_assertions=false&fetch_objects=false&use_compact_associations=false&association_type=causal

Get phenotypes associated with each disease from the above, eg

https://api.monarchinitiative.org/api/bioentity/disease/MONDO%3A0010351/phenotypes?rows=100&facet=false&unselect_evidence=false&exclude_automatic_assertions=false&fetch_objects=false&use_compact_associations=false

cmungall commented 5 years ago

Do you have any more context for the decision here?

I think restricting the propagation to causal genes makes sense

However, I don't get the reason to restrict the propagation to Obligates.

UPDATE: this may be less of a problem than I thought, see below, I thought we had lost all annotations for key genes. However, it would still be good to know the justification and have our rules clearly documented. Why are we doing something different from the HPO site? https://hpo.jax.org/app/browse/gene/55215

cmungall commented 5 years ago

@realmarcin, I am not sure I follow. You say:

The list of genes with this issue (only phenotype annotation returned are EFO terms) include: ... FANCI ...

However, I get many annotations when I query the monarch API for FANCI:

https://api.monarchinitiative.org/api/bioentity/gene/HGNC%3A3583/phenotypes?rows=100&facet=false&unselect_evidence=false&exclude_automatic_assertions=false&fetch_objects=false&use_compact_associations=false&association_type=causal

I also see that we have phenotypes for FANCI here:

https://beta.monarchinitiative.org/gene/HGNC:25568#phenotype

Therefore there must be something wrong with the workflow, as the API is giving phenotypes.

realmarcin commented 5 years ago

FANCI was not our example, so looks like this issue is not affecting all genes.

Here is the EPM2A example of diff example between beta and previous release -- 5 vs 46 phenotypes:

https://beta.monarchinitiative.org/gene/HGNC:3413#phenotype https://monarchinitiative.org/gene/HGNC:3413#phenotypes

kshefchek commented 5 years ago

@realmarcin this should be fixed by tomorrow morning. Apologies all -- it was requested that we remove variant to phenotype inferences. I was thinking in terms of graph distance, variants are closer to phenotypes than genes, so I incorrectly thought this meant we should tighten the gene to phenotype inferences as well. @TomConlin and @julesjacobsen pointed out why this is flawed and it all makes sense now.