monarch-initiative / monarch-ui

The previous version of the Monarch Initiative website
https://previous.monarchinitiative.org/
BSD 3-Clause "New" or "Revised" License
17 stars 28 forks source link

too many results in search #213

Closed mellybelly closed 5 years ago

mellybelly commented 6 years ago

for example, EHLERS-DANLOS SYNDROME, KYPHOSCOLIOTIC TYPE, 1 has 91491 matches

screen shot 2018-01-24 at 2 14 29 am
kshefchek commented 6 years ago

As a reference here is the page on production: https://monarchinitiative.org/search/EHLERS-DANLOS SYNDROME, KYPHOSCOLIOTIC TYPE, 1

The issue here is that we're matching on the terms "syndrome", "type", and "1". The solr relevancy score does factor in IDF (inverse document frequency) of terms in order to decrease the weight of common terms; however, it doesn't completely filter out these terms in the results. If anyone is interested here are the wiki pages for the algorithm solr uses for relevancy score: Current: https://en.wikipedia.org/wiki/Okapi_BM25 TF-IDF (Classic): https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Some quick ideas:

@kltm have you dealt with this in amigo?

cc @pnrobinson @DoctorBud I believe this was brought up on the last weekly call as well (for searches on Marfan Syndrome), which looks much better with the stopword filter: https://beta.monarchinitiative.org/search/marfan%20syndrome vs https://monarchinitiative.org/search/marfan%20syndrome

kltm commented 6 years ago

Looking at the results list, and keeping in mind that your Solr installation is set to a default "OR" search and using score to bring up results, the large number of results with general terms like term and 1 would be completely expected. It might be better to view this as a UI issue, possibly limiting results to a certain score threshold when the returned numbers are large.

That said, there could be some tweaks in there to get the 4 result to 3; some playing with field boosts could help, but Solr's use of preferring larger (more informative) words and match counts seems to e about right.

kshefchek commented 6 years ago

bump, from @realmarcin:

I’m searching with a specific MONDO curie https://monarchinitiative.org/search/MONDO%3A0000554 the top hit looks correct, however, the top left page section seems to suggest over 20k diseases? ah looks like its matching ‘MONDO’ …

jmcmurry commented 6 years ago

please move future curie-search issue conversation over to https://github.com/monarch-initiative/monarch-app/issues/1625

kshefchek commented 5 years ago

Transferring to the new app as this was never addressed.

@pnrobinson recently discovered another case when querying a HGVS variant label, https://beta.monarchinitiative.org/search/NM_144997.5(FLCN):c.1429C%3ET%20(p.Arg477Ter)

Setting debugQuery=true, I can see the tokenizer is being aggressive with the punctuation in the hgvs format, the tokens queried are:

  1. nm
  2. 144997.5
  3. flcn
  4. 1429c
  5. t
  6. c
  7. nm_144997.5(flcn):c.1429c>t
  8. nm_144997.5(flcn):c.1429c>t (p.arg477ter)

Which results in many false positives (from a domain perspective). Some quick ideas

  1. Quote all input strings, effectively overriding the tokenizer (could result in a drop in true positives)
  2. Test out another tokenizer, eg classic tokenizer instead of standard
  3. Use the solr relevancy score and adjust at the UI level, as @kltm suggested above https://github.com/monarch-initiative/monarch-ui/issues/213#issuecomment-361387739
pnrobinson commented 5 years ago

@kshefchek @mellybelly I wonder if we can have one search for the initial autocomplete, but once the user presses go, we switch strategies and show only exact or very near matches?

kshefchek commented 5 years ago

The keyword tokenizer is the closest we have to an exact match, https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-KeywordTokenizer. This is one of three we are matching on (standard/classic tokenizer, edge ngram, keyword). We could go this route but we may lose valid hits, keeping in mind our test cases outlined in https://github.com/monarch-initiative/monarch-app/issues/1383

I tested out the classic tokenizer but this did not help much with the HGVS label, 297106 down to 248200 matches, the only difference is instead of the two tokens "1429c" and "c", we get "1429c.c", which doesn't help much

kshefchek commented 5 years ago

fixed with https://github.com/monarch-initiative/monarch-ui/pull/214