Gene-disease lists are sometimes perplexing in multiple ways

cmungall commented 8 years ago

E.g. https://monarchinitiative.org/gene/NCBIGene%3A2263, diseases tab

This is confusing for anyone coming to the system and confusing to me too.

First of all, the reference is the same for all associations ("Crystal structure of fibroblast growth factor receptor ectodomain bound to ligand and heparin"). This is surely irrelevant.

The relationship for all of them is "inferred", with no explanation of what inferred means.

The source for all of them is the tuple (biogrid,omim)

What does this even mean? The user might be lead to believe this is inferred from interaction data. I don't think this is the case.

The evidence type is the pair (sequencing assay, imaging) which is kind of bonkers

It should be possible to easily discern relationships that come from OMIM (see http://omim.org/entry/176943). And these should not be marked inferred. Instead we have a melange of OMIM associations, plus associations to MESH and DOID. The ones to MESH and DOID are marked as coming from OMIM, but this is obviously not correct.

It's not clear what needs fixed here: dipper, scigraph clique merging, golr export queries, UI - or all of these. But we need to deobfuscate.

kshefchek commented 8 years ago

This appears to be fixed in the latest data release, note that both production and beta are running off the production servers while we update scigraph and the solr indexes.

Although these are still labelled as inferred since we link omim genes to disease through a blank node (anonymous variant). Not sure how to fix unless we check if a variant is an anonymous node, see: https://github.com/monarch-initiative/monarch-cypher-queries/blob/master/src/main/cypher/golr-loader/gene-disease.yaml

cmungall commented 8 years ago

I don't think an anon node hack is a good idea here. It would be better to have something based on the edge metadata. The anon node to disease edge should be marked as being trivial somehow?

Let's look at what's happening. The source is giving as an association X R Y. We are remodeling this as X R1 z1 R2 z2 ... Rn Y. But we end up wanting to show the inference to the user as X R Y anyway, so we reinfer, via some slightly opaque cypher.

Really our remodeled assertion chain is the inference, and what the source says is asserted. Perhaps we are overmodeling here? From a pragmatic non-ontology modeling POV we pull in an association table and then show it in a faceted view.

Alternatively dipper could do both: extract the source assertion and the remodeled assertion chain, pass both onwards.

kshefchek commented 8 years ago

I would like to remove direct/inferred from the relationship column, or at least add some documentation of what this means. This value comes from the qualifier property in solr and moves to the relation column in cases where this is empty (as an attempt to contextualize rows with no relation), but causes confusion.

As far as the overmodeling, you can see the various ways we link genes to diseases from omim here: https://github.com/kshefchek/dipper/blob/64c77c3c963f7a58cd5ba832befb7c2d989c6edb/scripts/stats/omim-gene-disease.py

The idea is that a gene cannot be directly linked to a disease, but rather inferred through an allele or variant. But the term inferred is ambiguous to the user, it could mean we inferred through joining sources or inferred biologically.

We could easily remove the anonymous variant used to connect genes to diseases without a reference allele or variant if that would be useful.

jmcmurry commented 8 years ago

Should we just remove direct/inferred from the relationship column? More descriptive qualifiers would be a larger discussion.- Kent

Agreed it is tricky to authoritatively "bin" these; could the chain of evidence be transparent without our fussing over what to call it? Is there any way to do that in the short term? Eg, could we click on the evidence and get a graph or table of the pieces of evidence used?

TomConlin commented 8 years ago

How do we handle attribution here? I see "Source" and two icons. Are they for the data providers of the ends of the inference chain? "direct" would be attributed to A data source, "inferred" would be attributed to us, (or at least to a description of our process ala IEA)

jmcmurry commented 8 years ago

Even these designations aren't super clear in all cases though. For instance, we have HPO annotations of OMIM and MP/HPO for OMIA. The actual evidence is from OMIM, OMIA, but the manually-curated mapping of those free text entries into structured data is what we need to compute the data. Is there no way to just list the steps?

monarch-initiative / monarch-legacy

Gene-disease lists are sometimes perplexing in multiple ways #1251