pombase / canto

The PomBase community curation tool
https://curation.pombase.org
Other
19 stars 7 forks source link

lucene search issues #218

Closed pombase-admin closed 9 years ago

pombase-admin commented 12 years ago

Hi Kim,

when searching for 'lysis' 'cytolysis' pops up, but not cell lysis. Not a huge problem in itself as cytolysis is a synonym for cell lysis (which is the 'primary name') Just looked a bit odd that the synonym shows rather than the primary name even though they both share this part of the name, so thought I'd tell you.

I added it to this page: https://sourceforge.net/apps/trac/pombase/wiki/Lucene\_issues

Original comment by: Antonialock

pombase-admin commented 12 years ago

Original comment by: ValWood

pombase-admin commented 12 years ago

Original comment by: kimrutherford

pombase-admin commented 12 years ago

For "spore germination abolished, small spores" instead of "spore germination abolished" unfortunately it's the way Lucene works. You're getting the first, longer one because the word "spore" appears twice. Lucene treats that as a better match because it found your search term twice.

I'll try to tweak things but it might be hard to get around.

I'll have a think about the best way to show more terms.

Original comment by: kimrutherford

pombase-admin commented 12 years ago

>Lucene treats that as a better match because it found your search term twice.

It would be great if this could be "over ridden" or weighted very low. I'm not sure that just because a word appears twice in a string it is more likely to be the term you need (this is just fortuitous if you happened to be looking for "spore germination abolished, small spores"

I usually feel that we should be matching the shortest "exact match" but this often doesn't appear to happen. In the case of spores the user probably wants to see as many terms containing "spores" as possible...

Original comment by: ValWood

pombase-admin commented 12 years ago

Part of the problem is that Lucene is tuned for large documents rather than for autocompleting. When searching a bunch of large documents it makes sense that if your keyword appears more times in a document it's more likely to be the one of interest. That's not ideal for autocompleting though.

I'll see what I can do to encourage Lucene to do the right thing here.

Original comment by: kimrutherford

pombase-admin commented 12 years ago

typing "premature" into phenotype search gives only 5 options and none of them is "premature activation of bipolar growth"

"preamature activation" doesn't even find it (although bizarrely it finds abnormal activation of bipola growth and abolished activation of bipolar growth (which are much less similar)

not until I get to "premature activation of bi" do i find it?

Original comment by: ValWood

pombase-admin commented 12 years ago

Original comment by: kimrutherford

pombase-admin commented 12 years ago

The new strategy works better. It fixes most of the issues on the Lucene_issues page. It doesn't fix G2/M, which I think is a bug I can fix.

It also doesn't fix the "nucleate" search issue because Lucene doesn't index sub-strings. That can't be easily fixed without adding explicit synonyms like "mono-nucleate", "bi-nucleate" and "multi-nucleate".

I'll do some more tweaking then I'll release it to the test tool.

Original comment by: kimrutherford

pombase-admin commented 12 years ago

The test tool has now been updated with some tweaks, including a fix for the "G2/M" problem.

Exact matches to the term names or synonyms now appear mostly as the first hit and always (in the ones I've looked at) in the top 10. Please let me know of any cases you see where that doesn't happen.

I'm going through all of the notes on this ticket.

- searching for "spore" still doesn't return "spore germination abolished" in the top 10. That's because the other results have "spore" twice in the string, which Lucene likes. Or they are short, which Lucene likes. It also prefers matches that contain words that aren't common in the names and synonyms: "spore germination abolished" occurs in a lot of term names so it thinks that it's not an interesting thing to return. Not good in our case though. I'll see what can be done to persuade it otherwise.

- 'typing "premature" into phenotype search gives only 5 options and none of them is "premature activation of bipolar growth"' - this is because the synonym "premature NETO" is much shorter so it shows that instead. It does the same in the new scheme in the test tool. I haven't done any weighting yet to encourage Lucene to find the names rather than synonyms. I'll look at that next. What I should be able to do is check the name of a term if Lucene returns a synonym are see if you search string also matches the term name. If so, I can swap the the name into the results.

- typing "premature activation" does work now at least.

- typing "protein binding" now gives "protein binding" as the top hit - result!

Original comment by: kimrutherford

pombase-admin commented 12 years ago

That's much better. Keep thinking about the spore germination abolished, but I understand this one. not sure i understand the why "premature" doesn't find the premature activation...term, as it only finds 5 terms? presumably if it contains "premature" it should be in the list somewhere.

Original comment by: ValWood

pombase-admin commented 12 years ago

"premature" is finding the synonym "premature NETO" and showing that. It's a better match from Lucene's point of view because it's shorter than "premature activation of bipolar cell growth".

I'll fiddle with Lucene results so that in those sort of cases the name is shown instead.

Original comment by: kimrutherford

pombase-admin commented 12 years ago

Ok I wouldn't try to improve the premature NETO/activiation of bipolar growth. That makes perfect sense, and the users will find it from this.

Original comment by: ValWood

pombase-admin commented 12 years ago

Another lucene search issue

If I type "flocculation" I see only normal flocculation absent flocculation (synonym) flocculating cell

I dont see synonym: "increased flocculation" EXACT for flocculating cells at all

Original comment by: ValWood

pombase-admin commented 12 years ago

Because all of the lucene search issues reported here are fixed , except the most recent one, i am closing this item. I will open a new lower priority one for the new item.

Original comment by: ValWood

pombase-admin commented 12 years ago

Original comment by: ValWood