pombase / canto

The PomBase community curation tool
https://curation.pombase.org
Other
19 stars 7 forks source link

Lucene parse error when looking up punctuation characters #2375

Open jseager7 opened 3 years ago

jseager7 commented 3 years ago

I've noticed that if you enter a search term consisting only of punctuation characters into the term or experimental conditions lookup, Lucene throws a parse error in the server console:

internal error: Caught exception in Canto::Controller::Service->lookup "[Lucene::QueryParser->parse()] Unrecognized char 42 at 52::1." at lib/Canto/Controller/Root.pm line 55.

I have no idea what 'char 42' is meant to be, since the code never changes regardless of which punctuation character you type. I tried looking up the ASCII code for 42, and apparently it's an asterisk.

This bug doesn't occur if you enter a single letter, number, or white-space into the lookup. If you have a symbol and some white-space, the bug still happens (probably due to white-space being trimmed). However, if you enter a punctuation symbol followed by letters or numbers (ideally by pasting the text to ensure the autocomplete doesn't have time to run), then the bug doesn't occur. So this query works fine:

'+ bleo' → '+ bleomycin'

It's possible that users could encounter this bug if they try to enter '+' in an attempt to get back all of the conditions terms that involve the addition of some chemical (this is how I encountered the bug).


This bug also seems to break jQuery UI's ability to know when the autocomplete loading is done, so the spinner never goes away, even after you do another successful lookup. You have to close the modal to get rid of it.

image

kimrutherford commented 3 years ago

Thanks for spotting that. The Lucene index lookup now cleans the input string before passing to Lucene. I'm surprised I hadn't done that already. I've commited the fixed. Could you let me know if it helps?

jseager7 commented 3 years ago

The above fix prevents errors when entering queries with only punctuation characters, but I can trigger a different error by entering a letter followed by a punctuation character, for example: 'e +', which causes the following error:

internal error: Caught exception in Canto::Controller::Service->lookup "[Lucene::Search::IndexSearcher->search()] Too Many Clauses" at lib/Canto/Controller/Root.pm line 55.

I should've checked this when I initially reported the bug, but Lucene does specify a list of special characters, so maybe it would make more sense to escape them rather than stripping them out. That way Canto might be able to return all chemical addition terms with the query '+'. The special characters are:

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /

(source)

I don't know if Lucene has a method to escape the query string automatically; if not, the following substitute regular expression should work:

s/([+\-&|!(){}\[\]^"~*?:\\\/])/\\$1/g
jseager7 commented 3 years ago

Note that this isn't the first time the 'Too Many Clauses' error has shown up: see https://github.com/pombase/canto/issues/1482

jseager7 commented 3 years ago

I've tested escaping the special characters, and while it has the same effect of fixing the bug, it still doesn't trigger the lookup when only punctuation characters are included. This could be a side-effect of commit d4f050e, which prevented very short queries, but from reading around, it also seems that Lucene's query parser automatically removes special characters when using the StandardAnalyzer class (at least in the Java version).

This behaviour might explain why I'm seeing the Too Many Clauses error again: if Service.pm is checking the length of the query before sending the query to Lucene, then it might not account for the special characters being removed later.

This is the code that currently checks the query length:

if (length ($search_string =~ s/^\s+//r =~ s/\s+$//r) <= 2) {
  return [];
}

So presuming the following query is submitted:

'+ a'

After removing leading and trailing white-space, the text is still 3 characters in length, so it passes the check. However, after it gets to Lucene, the special character is presumably removed (and possibly the leading space as well), meaning the final query string is this:

'a'

And that's probably enough to trigger the Too Many Clauses error.

We might be able to fix this problem by using a more intelligent query strategy (or just by increasing the clause limit), but I have no idea where to start – I don't know how to configure Lucene in Canto, or even where the configuration is specified.

jseager7 commented 3 years ago

To clarify, it seems Lucene converts many types of queries (PrefixQuery, FuzzyQuery, WildcardQuery, RangeQuery) into a BooleanQuery, and by default if that matches more than 1024 entries, you get the Too Many Clauses error. (source)

ValWood commented 2 years ago

Is this still current? (Dec 2020)

jseager7 commented 2 years ago

The bug that the issue was originally about has been fixed with a workaround (I think), but there's still an error under different conditions which I described in an earlier comment:

I can trigger a different error by entering a letter followed by a punctuation character, for example: 'e +',

See https://github.com/pombase/canto/issues/2375#issuecomment-741670632.

Maybe it's up to @kimrutherford whether or not that needs to be addressed?

ValWood commented 10 months ago

I have put at "low priority" for now, I'm trying to plan the next few months and to group related tickets across Kim's trackers. @kimrutherford can reprioritise this if it should be higher.