Closed jmcmurry closed 6 years ago
I love this. please add species to genes and diseases.
longer term, you could collapse up to a taxon, or list # of species that we have data for (e.g. Pax2 (four species)) or (Pax2 (vertebrata))
@mellybelly There are 8 "Pax" results in the gene category -- each of which has multiple corresponding species; it gets especially complex if you consider case typing variations in the gene name. Blue sky -- How would you like to see taxon represented in cases like this? Is the ideal for EACH result to show the corresponding nearest taxon parent? Keep in mind that the more logic we rely on in autocomplete, the slower it gets. Just now, I get a 504 for http://monarchinitiative.org/gene/NCBIGene%3A18503 when trying to select Pax-1 gene from here:
After several discussions with the folks working on Golr, it was suggested that I use Golr as the source of the autocomplete data. I have been asked to outline my requirements and assumptions regarding this functionality. I would like a method that takes a string representing a part of a label (ex: 'parkin', 'alzhe') and returns a list of matching concepts. The call parameters should be: getAutocomplete(string searchString, array categories, int limit). Where searchString is the partial string match (ex: 'parkin', 'alzhe'), categories is a list of the types of data to return (ex: gene, disease, etc.) an null would mean do not filter the categories, and limit caps the number of items (ex: 10, 15, 20) returned. The limit should sort the results according to score and return the top N items. The searchString should match the prefLabel (unless the ontology folks think otherwise) of the item.
The data returned should be an array of tuples representing: identifier: an identifier that can be used to find the item within the monarch context (ex: NCBIGene:135138, DOID:11193, etc.) label: the prefLabel for the item. This should contain the searchString. category: a string representing the category of the item (gene, disease, phentoype, etc.) taxonid: the numeric taxon id from NCBI (ex: 9606, 10090, etc.) taxonLabel: the human readable version of the taxonid (Homo sapiens, Mus Musculus, etc.)
Q on requirements:
Maybe collapse text-equivalent phenotypes and diseases into a single result group called "diseases and phenotypes"
How would this work? We have to send users to either a disease page OR a phenotype page. Unless we also merge the distinction between these pages more, or make it easy to traverse between them. This is hard, so I would say remove this requirement for the time being
An autocomplete result in a combined P/D category wouldn't necessarily mean that each option has both a corresponding Phenotype and Disease record. This is more the exception (like cleft palate) than the rule.
Thus agreed, it is fine to punt this requirement and reconsider it later.
For that later conversation ... collapsing in autocomplete doesn't necessarily mean you have to collapse in the downstream. If an option was selected from the p/d category, it could lead to a results page where you could choose either phenotype or disease.
For species qualifiers on genes, we have a few different strategies. These are not mutually exclusive.
The simplest is to fold the species into the label ahead of time. This is the strategy we take for neo (https://github.com/geneontology/neo/). It is equivalent to the strategy taken for PRO. This has the advantage of requiring no special logic on the client side, which would presumably be faster as well as simpler. There is no need for any taxon ID in the results, it's a completely generic autocomplete. The disadvantage is that it is harder to make the species show in a different color, as we do now.
Alternatively, we need to ensure that taxon ID is populated in the golr index. This will require some plumbing. Our current strategy for loading the index is to use the generic golr ontology loader https://github.com/owlcollab/owltools/wiki/Loading-GAFs-Into-Solr which has no special knowledge of taxon IDs. It is possible to configure the yaml to do an additional call that will fetch this, but this requires some knowledge of the internals (cc @hdietze ). At this stage it may be simpler for @jnguyenx to extend the SciGraph golr loader to load entity nodes, rather than reuse the owltools one.
Note we can still do both: uniquify labels AND include taxon metadata in AC payload
The strategy may be informed by the roll-up strategy, below:
I assume the general strategy for solr would as follows:
client call AC API (ie solr), gets back a flat list where each entry contains sufficient metadata for the client to collapse. E.g. each entry could contain an orthogroup field:
{label=pax6, taxon=9606, orthogroup=123, ...
{label=pax6, taxon=10090, orthogroup=123, ...
{label=pax7, taxon=9606, orthogroup=9876, ...
the client would then collapse into the orthogroup. The same general strategy would work for a 2-level collapse (e.g. gene orthogroup and family). Also instead of orthogroup, we could have a representative member field, that would by default point to the human gene.
perhaps I'm thinking about solr naively and there is a better way to do this, cc @kltm
I'm also not clear on certain aspects. For example, if we have a cutoff of N
entries, and there are >N
genes in the group, how does this affect things?
Given the highly specific requirements, there is an argument for writing a special-purpose taxon-aware autocomplete API within SciGraph. This would probably need a lot of tuning to be as responsive as solr.
Either way, more works needs done on the server side to provide the client the information it needs to provide informative AC
We also want to steer people away from deadend pages. This could be done by providing additional metadata in the AC entry, as simple as has-data=true, but possibly something more sophisticated
What happens when the user hits return on AC? Do the search results look the same as the AC?
This is essentially what we do now:
https://monarchinitiative.org/search/pax
Do we include more?
There's an MGI mouse http://www.informatics.jax.org/allele/MGI:4939486 that has a mutation in Ptpn6 https://monarchinitiative.org/gene/MGI:96055 This mouse is directly reachable in Monarch if we use the ID: https://monarchinitiative.org/model/MGI:4939486 However, if we are searching via the gene by its symbol, we can not get there because:
https://monarchinitiative.org/search/Ptpn6
This is evidence that in this brave new era of dozens of species, we are now totally overloading the autocomplete box and demonstrates why my personal feeling is that we should not be including species-specific autocomplete matches in there at all (That should be part of the search results instead). Ideally, sure we would want to have higher taxon level groupings in autocomplete, however my concern there is that autocomplete is already so sluggish.
Regardless, I hope we can all agree that a) the search results not be artificially bounded at n matches and b) in results view, taxon should not be blank.
ps, interdigitation of "Ptpn6 gene" with "Ptpn6 [species]" is confusing, but based on the search results page missing the taxon, I'm guessing this just reflected in the autocomplete results (so default to the class)? @kshefchek
I have to defer to @cborromeo on any autocomplete/search developments.
In agreement with regards to an uber gene, see: https://github.com/monarch-initiative/monarch-app/issues/1210#issue-142510343
@kshefchek Thanks for clarifying. re: uber gene families, it is a little confusing. Going to the result labeled 'gene' gets me to a page for which it is not obvious that it refers to a gene family. Perhaps we could make this clearer? Broke out a ticket to this effect. https://github.com/monarch-initiative/monarch-app/issues/1244
I'm in agreement we should implement this, but this hasn't been implemented. Right now if taxon information is not available in scigraph, we default back to the category, so in this case, we're missing the mouse taxon information for this gene.
Ah, I see now that I misunderstood you; gene family pages don't play a role in the weirdness we're currently seeing for Ptpn6. The equivalent (MOD-specific) genes are the ones with the right taxonomy, however, we're not inferring back to the NCBI gene equivalents (which I guess are the clique leaders and therefore) the ones we are pinging for autocomplete results?
The cliqueLeader nodes will have all edges from equivalent nodes moved to them, so I imagine this is missing in the data. I can try to track it down.
The need to break down results in categories is underscored by results like this:
The result "eyeless Fruit fly", in the absence of categorical clues could be gene, model, or species-specific phenotype.
I think we can close this for now. There are other improvements to make but should just be new tickets at this point.
This is a high level ticket to capture the discussions we have been having on the site search strategy, autocomplete in particular. Below are some of the possible enhancements. Not all are feasible. Some may not be desirable. Not everyone will agree 100% and that is good.
PAX interacting (with transcription-activation domain) protein 1
is among autocomplete results for pax but not in search results page for pax.Overall, here's my distilled version of an autocomplete proposal: