too many results in search

mellybelly commented 6 years ago

for example, EHLERS-DANLOS SYNDROME, KYPHOSCOLIOTIC TYPE, 1 has 91491 matches

kshefchek commented 6 years ago

As a reference here is the page on production: https://monarchinitiative.org/search/EHLERS-DANLOS SYNDROME, KYPHOSCOLIOTIC TYPE, 1

The issue here is that we're matching on the terms "syndrome", "type", and "1". The solr relevancy score does factor in IDF (inverse document frequency) of terms in order to decrease the weight of common terms; however, it doesn't completely filter out these terms in the results. If anyone is interested here are the wiki pages for the algorithm solr uses for relevancy score: Current: https://en.wikipedia.org/wiki/Okapi_BM25 TF-IDF (Classic): https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Some quick ideas:

Add stopwords, such as "syndrome" and "disorder", so these terms are not indexed. As a test I did this on beta, but the results are still overwhelmed with matches on "type" and "1" : https://beta.monarchinitiative.org/search/EHLERS-DANLOS SYNDROME, KYPHOSCOLIOTIC TYPE, 1
Extend and modify the similarity class. Solr allows for the configuration of third party java packages, so we could theoretically extend any similarity class and adjust the algorithm to more heavily weight IDF
Test out other relevancy algorithms (I tried TF-IDF and the results were the same)
Set a global result limit
Filter out documents with an X decrease in max relevancy score (this is discouraged, since the max relevancy score and distribution changes for each query).

@kltm have you dealt with this in amigo?

cc @pnrobinson @DoctorBud I believe this was brought up on the last weekly call as well (for searches on Marfan Syndrome), which looks much better with the stopword filter: https://beta.monarchinitiative.org/search/marfan%20syndrome vs https://monarchinitiative.org/search/marfan%20syndrome

kltm commented 6 years ago

Looking at the results list, and keeping in mind that your Solr installation is set to a default "OR" search and using score to bring up results, the large number of results with general terms like term and 1 would be completely expected. It might be better to view this as a UI issue, possibly limiting results to a certain score threshold when the returned numbers are large.

That said, there could be some tweaks in there to get the 4 result to 3; some playing with field boosts could help, but Solr's use of preferring larger (more informative) words and match counts seems to e about right.

kshefchek commented 6 years ago

bump, from @realmarcin:

I’m searching with a specific MONDO curie https://monarchinitiative.org/search/MONDO%3A0000554 the top hit looks correct, however, the top left page section seems to suggest over 20k diseases? ah looks like its matching ‘MONDO’ …

jmcmurry commented 6 years ago

please move future curie-search issue conversation over to https://github.com/monarch-initiative/monarch-app/issues/1625

kshefchek commented 5 years ago

Transferring to the new app as this was never addressed.

@pnrobinson recently discovered another case when querying a HGVS variant label, https://beta.monarchinitiative.org/search/NM_144997.5(FLCN):c.1429C%3ET%20(p.Arg477Ter)

Setting debugQuery=true, I can see the tokenizer is being aggressive with the punctuation in the hgvs format, the tokens queried are:

nm
144997.5
flcn
1429c
t
c
nm_144997.5(flcn):c.1429c>t
nm_144997.5(flcn):c.1429c>t (p.arg477ter)

Which results in many false positives (from a domain perspective). Some quick ideas

Quote all input strings, effectively overriding the tokenizer (could result in a drop in true positives)
Test out another tokenizer, eg classic tokenizer instead of standard
Use the solr relevancy score and adjust at the UI level, as @kltm suggested above https://github.com/monarch-initiative/monarch-ui/issues/213#issuecomment-361387739

pnrobinson commented 5 years ago

@kshefchek @mellybelly I wonder if we can have one search for the initial autocomplete, but once the user presses go, we switch strategies and show only exact or very near matches?

kshefchek commented 5 years ago

The keyword tokenizer is the closest we have to an exact match, https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-KeywordTokenizer. This is one of three we are matching on (standard/classic tokenizer, edge ngram, keyword). We could go this route but we may lose valid hits, keeping in mind our test cases outlined in https://github.com/monarch-initiative/monarch-app/issues/1383

I tested out the classic tokenizer but this did not help much with the HGVS label, 297106 down to 248200 matches, the only difference is instead of the two tokens "1429c" and "c", we get "1429c.c", which doesn't help much

kshefchek commented 5 years ago

fixed with https://github.com/monarch-initiative/monarch-ui/pull/214

monarch-initiative / monarch-ui

too many results in search #213