monarch-initiative / monarch-legacy

Monarch web application and API
BSD 3-Clause "New" or "Revised" License
42 stars 37 forks source link

Reroute failed searches to the closest matching term & include warning / escape hatch #1320

Open jmcmurry opened 8 years ago

jmcmurry commented 8 years ago

A big takehome in the recent UX efforts has been that people often don't wait for autocomplete. Moreover, autocomplete is not well suited to hunter-pecker typists. Thus (in addition to improving performance) we need to also improve site search results.

Lots of diseases, phenotypes, and genes have peculiar or difficult spellings/punctuation. When there are no exact matches at the level of the word, we currently show "No results", even when loads of content is a mere character away.

At the back end, we thus need fuzzy matching (not just stemming and wildcard). When things are lexically very similar, we should follow the giants and default to the closest matching term with a warning.

screen shot 2016-07-28 at 9 46 40 am
harryhoch commented 8 years ago

+1. Solr should help here, as it supports searches within some edit distance

jmcmurry commented 8 years ago

Actually this scenario is even worse than I originally realized. a) This includes no site-search hits for any of the hyphenated forms of the disease (eg. related to #1282)

screen shot 2016-07-28 at 11 12 27 am

b) Including the hyphen only, and not Dietz gets you no results, but does get you a "did you mean" page, which I didn't even know we had :). Although the "did you mean" functionality is in principle a good idea, in its current implementation leads the user to think that fuzzy results are being accounted for. Moreover the suggestions bear less lexical similarity to the query term than the relevant results do...especially confusing.

screen shot 2016-07-28 at 11 15 41 am
kltm commented 8 years ago

@harryhoch Typically the spellchecker component in Solr. I would suggest building something like "closest" into the API. Low or zero returns would trigger a secondary search to the spellcheck component and offer suggestions.

jmcmurry commented 8 years ago

@kltm can you think of any scenario where delimiters/punctuation (other than space and colon) should be given special treatment? Seems like if we just normalized both the query and the index to be punctuation-free we might be a lot better off?

kltm commented 8 years ago

Well, dashes, maybe--distance is not always great, and people who know what they're looking for usually want to nail it right away.

In the end, you may be heading towards a heavier customization of the solr tokenizers, parsers, etc. than the typical out-of-the-box experience. It would probably be good to map tickets more directly onto features and figure out the best way to bring them into the APIs.

harryhoch commented 8 years ago

@kltm - good points.

How do we configure solr to search for mouse strain names? :-)

jmcmurry commented 8 years ago

Sorry, just to clarify my suggestion, I'm not talking about removing the canonical punctuation from what is displayed to users. (That would be a bit bonkers especially in genes with legit hyphenation and periods etc). I'm just suggesting that the ranking algorithm on the back end not be trying to ascribe importance to it.

kltm commented 8 years ago

@jmcmurry Correct--my comment is taking that into account.