monarch-initiative / monarch-legacy

Monarch web application and API
BSD 3-Clause "New" or "Revised" License
42 stars 37 forks source link

Remove or differentiate duplicate disease-gene associations #735

Open kshefchek opened 9 years ago

kshefchek commented 9 years ago

Currently there are quite a few duplicate gene phenotype associations, see here: http://beta.monarchinitiative.org/labs/widget-scratch/disease/OMIM:274600

Without investigating, it's likely each result has a different evidence graph, for example a gene-disease association through many variants of a single gene. How should we approach this on the front end? Do we want to have a variant field in the golr results? Or alternatively should this be uniquified for now?

cmungall commented 9 years ago

page can't be found...

kshefchek commented 9 years ago

Rerouted, now http://beta.monarchinitiative.org/disease/OMIM:274600

cmungall commented 9 years ago

Ooh, nice we've made the switch on beta

But yes I see what you mean here. The easiest solution is to fix this at the SG->Golr level, collapse these, and make a more complex evidence graph - seem ok @ccondit and @nlwashington ?

nlwashington commented 9 years ago

yes, absolutely!

nlwashington commented 9 years ago

i think what is happening here is this...

each source of a disease-gene association may use their own identifier system for both disease and genes. this means that we might have:

D - G D' - G D - G' D' - G'

where D == D' and G == G'. So the list of associations may list all four combinations, even though we know that they are equivalent. this becomes a big problem when there are 10 gene-equivalents and 5 disease equivalents...it explodes.

we can approach this by processing the graph to:

  1. materializing all equivalence edges (so we don't miss any)
  2. iterating over all clusters of equivalent things, and applying some rules for identifying one of the nodes in the equivalent set as the representative for the group.

for example, for genes, we may choose to use NCBIGene > ENSEMBL > others. or maybe we would want to make the MOD gene id representative (MGI, HGNC, ZFIN, FB, etc), followed by the others. similarly, variants might follow the MOD variant id > dbSNP > other.

then, the one that is the representative of the group would get a special category attached to the node, and this could be used in cypher queries, and be leveraged when building the golr for the associations. then all equivalent ids that are used in the cyper path would go in the evidence graph.

ccondit commented 9 years ago

As another option (and perhaps a temporary solution) we can handle this with Solr's grouping feature: https://wiki.apache.org/solr/FieldCollapsing And as an actual example: http://geoffrey.crbs.ucsd.edu:8080/solr/golr/select/?q=subject:NCBIGene\:1981%20AND%20object_category:phenotype&group=true&group.field=object&wt=json

@kshefchek - is it possible to use this solution for this release?