Draft of possible enhancements to site autocomplete

jmcmurry commented 9 years ago

This is a high level ticket to capture the discussions we have been having on the site search strategy, autocomplete in particular. Below are some of the possible enhancements. Not all are feasible. Some may not be desirable. Not everyone will agree 100% and that is good.

[ ] Offer "exploratory search" options ("in all results") as well as "known item" search options
[ ] Collapse all text-matched genes, regardless of case / species, into a single item, leaving the finesse of ranking results (human first etc) to site search results page. I could provide a detailed rationale, but one big part of it is that we are up at over 15 species (yay!) and just listing them is crowding the space now (boo!).
[ ] In addition to specific genes, roll up related genes into families where possible (eg. "pax gene family", or "blood coagulation factors")
[ ] Add short descriptive text to genes where available / appropriate (eg. pax2 = paired box gene)
[ ] Maybe collapse text-equivalent phenotypes and diseases into a single result group called "diseases and phenotypes"? Not sure if there are enough use cases to warrant this, but it would be nice for things like cleft palate shown below.
[ ] Improve performance. If it takes several seconds and a spinny widget to populate autocomplete list, it is hard to gain user trust.
[ ] Make the results container appropriately wide or truncate results so that we don't get word wrap for a standard display (see below) (For mobile device this is unavoidable)
[ ] Make sure that the search results landing page obviously contains at least the results from autocomplete list. The current behavior is weird, perhaps to do with which synonyms are pulled? PAX interacting (with transcription-activation domain) protein 1 is among autocomplete results for pax but not in search results page for pax.
[ ] Refine the algorithm so that we get matches on second words, words with apostrophes, identifiers, etc. Also allow narrative to be pasted in and handed over to the annotation widget. There are tickets open for this. I will link these later.
[ ] Show Search History when user clicks on entry box (before search term is entered)
[ ] Render 'previously visited' terms in purple

Overall, here's my distilled version of an autocomplete proposal:

Consider this	Rather than this

mellybelly commented 9 years ago

I love this. please add species to genes and diseases.

mellybelly commented 9 years ago

longer term, you could collapse up to a taxon, or list # of species that we have data for (e.g. Pax2 (four species)) or (Pax2 (vertebrata))

jmcmurry commented 8 years ago

@mellybelly There are 8 "Pax" results in the gene category -- each of which has multiple corresponding species; it gets especially complex if you consider case typing variations in the gene name. Blue sky -- How would you like to see taxon represented in cases like this? Is the ideal for EACH result to show the corresponding nearest taxon parent? Keep in mind that the more logic we rely on in autocomplete, the slower it gets. Just now, I get a 504 for http://monarchinitiative.org/gene/NCBIGene%3A18503 when trying to select Pax-1 gene from here:

frdougal commented 8 years ago

After several discussions with the folks working on Golr, it was suggested that I use Golr as the source of the autocomplete data. I have been asked to outline my requirements and assumptions regarding this functionality. I would like a method that takes a string representing a part of a label (ex: 'parkin', 'alzhe') and returns a list of matching concepts. The call parameters should be: getAutocomplete(string searchString, array categories, int limit). Where searchString is the partial string match (ex: 'parkin', 'alzhe'), categories is a list of the types of data to return (ex: gene, disease, etc.) an null would mean do not filter the categories, and limit caps the number of items (ex: 10, 15, 20) returned. The limit should sort the results according to score and return the top N items. The searchString should match the prefLabel (unless the ontology folks think otherwise) of the item.

The data returned should be an array of tuples representing: identifier: an identifier that can be used to find the item within the monarch context (ex: NCBIGene:135138, DOID:11193, etc.) label: the prefLabel for the item. This should contain the searchString. category: a string representing the category of the item (gene, disease, phentoype, etc.) taxonid: the numeric taxon id from NCBI (ex: 9606, 10090, etc.) taxonLabel: the human readable version of the taxonid (Homo sapiens, Mus Musculus, etc.)

cmungall commented 8 years ago

Q on requirements:

Maybe collapse text-equivalent phenotypes and diseases into a single result group called "diseases and phenotypes"

How would this work? We have to send users to either a disease page OR a phenotype page. Unless we also merge the distinction between these pages more, or make it easy to traverse between them. This is hard, so I would say remove this requirement for the time being

jmcmurry commented 8 years ago

An autocomplete result in a combined P/D category wouldn't necessarily mean that each option has both a corresponding Phenotype and Disease record. This is more the exception (like cleft palate) than the rule.

Thus agreed, it is fine to punt this requirement and reconsider it later.

For that later conversation ... collapsing in autocomplete doesn't necessarily mean you have to collapse in the downstream. If an option was selected from the p/d category, it could lead to a results page where you could choose either phenotype or disease.

cmungall commented 8 years ago

For species qualifiers on genes, we have a few different strategies. These are not mutually exclusive.

including taxon to disambiguate duplicate symbols across species

The simplest is to fold the species into the label ahead of time. This is the strategy we take for neo (https://github.com/geneontology/neo/). It is equivalent to the strategy taken for PRO. This has the advantage of requiring no special logic on the client side, which would presumably be faster as well as simpler. There is no need for any taxon ID in the results, it's a completely generic autocomplete. The disadvantage is that it is harder to make the species show in a different color, as we do now.

Alternatively, we need to ensure that taxon ID is populated in the golr index. This will require some plumbing. Our current strategy for loading the index is to use the generic golr ontology loader https://github.com/owlcollab/owltools/wiki/Loading-GAFs-Into-Solr which has no special knowledge of taxon IDs. It is possible to configure the yaml to do an additional call that will fetch this, but this requires some knowledge of the internals (cc @hdietze ). At this stage it may be simpler for @jnguyenx to extend the SciGraph golr loader to load entity nodes, rather than reuse the owltools one.

Note we can still do both: uniquify labels AND include taxon metadata in AC payload

The strategy may be informed by the roll-up strategy, below:

rolling up / collapsing genes

I assume the general strategy for solr would as follows:

client call AC API (ie solr), gets back a flat list where each entry contains sufficient metadata for the client to collapse. E.g. each entry could contain an orthogroup field:

{label=pax6, taxon=9606, orthogroup=123, ...
{label=pax6, taxon=10090, orthogroup=123, ...
...
{label=pax7, taxon=9606, orthogroup=9876, ...
...

the client would then collapse into the orthogroup. The same general strategy would work for a 2-level collapse (e.g. gene orthogroup and family). Also instead of orthogroup, we could have a representative member field, that would by default point to the human gene.

perhaps I'm thinking about solr naively and there is a better way to do this, cc @kltm

I'm also not clear on certain aspects. For example, if we have a cutoff of N entries, and there are >N genes in the group, how does this affect things?

Given the highly specific requirements, there is an argument for writing a special-purpose taxon-aware autocomplete API within SciGraph. This would probably need a lot of tuning to be as responsive as solr.

Either way, more works needs done on the server side to provide the client the information it needs to provide informative AC

Additional requirements: steer away from deadend pages

We also want to steer people away from deadend pages. This could be done by providing additional metadata in the AC entry, as simple as has-data=true, but possibly something more sophisticated

Relationship to search

What happens when the user hits return on AC? Do the search results look the same as the AC?

This is essentially what we do now:

https://monarchinitiative.org/search/pax

Do we include more?

jmcmurry commented 8 years ago

There's an MGI mouse http://www.informatics.jax.org/allele/MGI:4939486 that has a mutation in Ptpn6 https://monarchinitiative.org/gene/MGI:96055 This mouse is directly reachable in Monarch if we use the ID: https://monarchinitiative.org/model/MGI:4939486 However, if we are searching via the gene by its symbol, we can not get there because:

1. The Mus musculus is not on the autocomplete shortlist

2. Nor is it among the search results.

https://monarchinitiative.org/search/Ptpn6

This is evidence that in this brave new era of dozens of species, we are now totally overloading the autocomplete box and demonstrates why my personal feeling is that we should not be including species-specific autocomplete matches in there at all (That should be part of the search results instead). Ideally, sure we would want to have higher taxon level groupings in autocomplete, however my concern there is that autocomplete is already so sluggish.

Regardless, I hope we can all agree that a) the search results not be artificially bounded at n matches and b) in results view, taxon should not be blank.

jmcmurry commented 8 years ago

ps, interdigitation of "Ptpn6 gene" with "Ptpn6 [species]" is confusing, but based on the search results page missing the taxon, I'm guessing this just reflected in the autocomplete results (so default to the class)? @kshefchek

kshefchek commented 8 years ago

I have to defer to @cborromeo on any autocomplete/search developments.

In agreement with regards to an uber gene, see: https://github.com/monarch-initiative/monarch-app/issues/1210#issue-142510343

jmcmurry commented 8 years ago

@kshefchek Thanks for clarifying. re: uber gene families, it is a little confusing. Going to the result labeled 'gene' gets me to a page for which it is not obvious that it refers to a gene family. Perhaps we could make this clearer? Broke out a ticket to this effect. https://github.com/monarch-initiative/monarch-app/issues/1244

kshefchek commented 8 years ago

I'm in agreement we should implement this, but this hasn't been implemented. Right now if taxon information is not available in scigraph, we default back to the category, so in this case, we're missing the mouse taxon information for this gene.

jmcmurry commented 8 years ago

Ah, I see now that I misunderstood you; gene family pages don't play a role in the weirdness we're currently seeing for Ptpn6. The equivalent (MOD-specific) genes are the ones with the right taxonomy, however, we're not inferring back to the NCBI gene equivalents (which I guess are the clique leaders and therefore) the ones we are pinging for autocomplete results?

kshefchek commented 8 years ago

The cliqueLeader nodes will have all edges from equivalent nodes moved to them, so I imagine this is missing in the data. I can try to track it down.

jmcmurry commented 8 years ago

The need to break down results in categories is underscored by results like this:

The result "eyeless Fruit fly", in the absence of categorical clues could be gene, model, or species-specific phenotype.

jmcmurry commented 7 years ago

I think we can close this for now. There are other improvements to make but should just be new tickets at this point.

kshefchek commented 6 years ago

See ^ https://github.com/monarch-initiative/monarch-app/issues/1008#issuecomment-317058990

monarch-initiative / monarch-legacy