Open kshefchek opened 9 years ago
page can't be found...
Rerouted, now http://beta.monarchinitiative.org/disease/OMIM:274600
Ooh, nice we've made the switch on beta
But yes I see what you mean here. The easiest solution is to fix this at the SG->Golr level, collapse these, and make a more complex evidence graph - seem ok @ccondit and @nlwashington ?
yes, absolutely!
i think what is happening here is this...
each source of a disease-gene association may use their own identifier system for both disease and genes. this means that we might have:
D - G D' - G D - G' D' - G'
where D == D' and G == G'. So the list of associations may list all four combinations, even though we know that they are equivalent. this becomes a big problem when there are 10 gene-equivalents and 5 disease equivalents...it explodes.
we can approach this by processing the graph to:
for example, for genes, we may choose to use NCBIGene > ENSEMBL > others. or maybe we would want to make the MOD gene id representative (MGI, HGNC, ZFIN, FB, etc), followed by the others. similarly, variants might follow the MOD variant id > dbSNP > other.
then, the one that is the representative of the group would get a special category attached to the node, and this could be used in cypher queries, and be leveraged when building the golr for the associations. then all equivalent ids that are used in the cyper path would go in the evidence graph.
As another option (and perhaps a temporary solution) we can handle this with Solr's grouping feature: https://wiki.apache.org/solr/FieldCollapsing And as an actual example: http://geoffrey.crbs.ucsd.edu:8080/solr/golr/select/?q=subject:NCBIGene\:1981%20AND%20object_category:phenotype&group=true&group.field=object&wt=json
@kshefchek - is it possible to use this solution for this release?
Currently there are quite a few duplicate gene phenotype associations, see here: http://beta.monarchinitiative.org/labs/widget-scratch/disease/OMIM:274600
Without investigating, it's likely each result has a different evidence graph, for example a gene-disease association through many variants of a single gene. How should we approach this on the front end? Do we want to have a variant field in the golr results? Or alternatively should this be uniquified for now?