Open jmcmurry opened 8 years ago
+1. Solr should help here, as it supports searches within some edit distance
Actually this scenario is even worse than I originally realized. a) This includes no site-search hits for any of the hyphenated forms of the disease (eg. related to #1282)
b) Including the hyphen only, and not Dietz gets you no results, but does get you a "did you mean" page, which I didn't even know we had :). Although the "did you mean" functionality is in principle a good idea, in its current implementation leads the user to think that fuzzy results are being accounted for. Moreover the suggestions bear less lexical similarity to the query term than the relevant results do...especially confusing.
@harryhoch Typically the spellchecker component in Solr. I would suggest building something like "closest" into the API. Low or zero returns would trigger a secondary search to the spellcheck component and offer suggestions.
@kltm can you think of any scenario where delimiters/punctuation (other than space and colon) should be given special treatment? Seems like if we just normalized both the query and the index to be punctuation-free we might be a lot better off?
Well, dashes, maybe--distance is not always great, and people who know what they're looking for usually want to nail it right away.
In the end, you may be heading towards a heavier customization of the solr tokenizers, parsers, etc. than the typical out-of-the-box experience. It would probably be good to map tickets more directly onto features and figure out the best way to bring them into the APIs.
@kltm - good points.
How do we configure solr to search for mouse strain names? :-)
Sorry, just to clarify my suggestion, I'm not talking about removing the canonical punctuation from what is displayed to users. (That would be a bit bonkers especially in genes with legit hyphenation and periods etc). I'm just suggesting that the ranking algorithm on the back end not be trying to ascribe importance to it.
@jmcmurry Correct--my comment is taking that into account.
A big takehome in the recent UX efforts has been that people often don't wait for autocomplete. Moreover, autocomplete is not well suited to hunter-pecker typists. Thus (in addition to improving performance) we need to also improve site search results.
Lots of diseases, phenotypes, and genes have peculiar or difficult spellings/punctuation. When there are no exact matches at the level of the word, we currently show "No results", even when loads of content is a mere character away.
At the back end, we thus need fuzzy matching (not just stemming and wildcard). When things are lexically very similar, we should follow the giants and default to the closest matching term with a warning.