monarch-initiative / monarch-legacy

Monarch web application and API
BSD 3-Clause "New" or "Revised" License
42 stars 37 forks source link

Disambiguate counts on data tabs/tables #1263

Open kshefchek opened 8 years ago

kshefchek commented 8 years ago

Any given data table has three types of counts

  1. Distinct associations between X (and subclasses) and Y
  2. Distinct Y
  3. Distinct classes in closure

For example, on a phenotype page: Abnormality of the central nervous system, on the gene table we have:

  1. Distinct associations: 51423
  2. Distinct genes: 7537
  3. Distinct phenotypes in closure: not shown

Right now 2 appears in the tab, and 1 appears in the table view. Without documentation, this is confusing. How can we better display these counts? cc @jmcmurry

ubuntu

Issue reported by @cindyJax @sbello, cc @mellybelly

jmcmurry commented 8 years ago

Thanks for bringing this up again. In addition to formally documenting, I wonder if it is better to just disambiguate the number of distinct relationships, versus the number of distinct subjects and objects. For instance, like so:

screen shot 2016-06-07 at 11 01 11 am

Wherein the (i) provides formal documentation--if necessary explaining closure.

I don't feel strongly about having the inclusion of subtypes be something that is configurable. It seems silly to not want that. We could alternatively just say "including subclasses".

Thoughts welcome.

cmungall commented 8 years ago

We should functionally distinguish between 'entities' and 'terms'. Even though these may both be modeled as ontology classes, there is an expected difference in behavior. Entities typically form a disjoint set without any primary classification axis. Terms form a subsumption lattice or similar. So it's meaningful to count entities (e.g. number of genes in the "abnormality of CNS" gene set). It's difficult to meaningfully count terms (e.g. number of phenotypes for "SHH") due to redundancy.

This somewhat breaks down for entities like genotypes which subsume in their partonomy, but still useful. It also breaks down with diseases: restricting to OMIM there is (more or less) a disjoint set, but with a hierarchy, where we have associations from more generic disease classes, questions of the form "how many diseases have ..." becomes a bit more problematic.

Every tab should be conceived of as a relation, with the page being either subject or object. Reasoning should always be used. So for a phenotype page P, the query is "has-phenotype some P". In this case the reasoning is trivial so there is no immediate need for an explanation. In other cases we will need to be more explicit about the reasoning (note that the reasoning task is distributed, with some taking place during query that populates golr, see https://github.com/monarch-initiative/dipper/issues/324, and some taking place using the closure indices). For complete explanations, a graph view is probably best.

Broadly, there are two separate responses to the "R some X" query. One is a set of things that satisfy the query, the other is the set of things plus the immediate assertions about those things that lead to the query being satisfied. For example, for the genes tab on the phenotype page P, the set of entities are the genes, and the set of assertions are the set of associations to some subclass of P. Procedurally it can be easiest to think in terms of the closure fields but thinking in terms of reasoning and explanations is more powerful.

This framework can be used for everything, and we can be creative about how we display this. Kent has some ideas about a graphical display. But for the basic table oriented display, some key points:

We attempt to provide a way to switch between these views in amigo

E.g. by default we show associations http://tomodachi.berkeleybop.org/amigo/term/GO:0007417 But there is a link to get the entities.

We have some ideas on how to improve this but haven't had the time. We're kind of exposing the solr denormalization a bit too much. Ideally you would not require the user to switch but you would see something that combined both.

kshefchek commented 6 years ago

Adding @qjwang2001 as a watcher

lwinfree commented 6 years ago

Any updates on this? I'm still seeing different numbers displayed on the tabs vs in the data table. See pic (421 phenotypes listed vs 906 in reality):

screen shot 2018-02-15 at 2 09 14 pm
kshefchek commented 6 years ago

We're abandoning the bbop tables in favor of a new widget @putmantime is working on. We should make sure we address this.

@lwinfree viewing a disease group, the association count will not always match the distinct number of phenotypes, for example, when two disease subclasses are annotated to the same phenotype.

jmcmurry commented 6 years ago

Proposal for comment:

UI:

TSV Download:

kshefchek commented 6 years ago

When interacting directly with solr this is somewhat challenging. We can't page or leverage faceting abilities when operating on the distinct list. However, proxying through biolink may solve some of this.

jmcmurry commented 6 years ago

Future comments on this should really refer to how things are being represented in alpha. eg: https://alpha.monarchinitiative.org/disease/MONDO:0016033#gene