oracle / opengrok

OpenGrok is a fast and usable source code search and cross reference engine, written in Java
http://oracle.github.io/opengrok/
Other
4.33k stars 746 forks source link

Lucene tokenization problem with search strings #934

Open bond- opened 9 years ago

bond- commented 9 years ago

For ex: I ran the following command

java -cp ./opengrok.jar org.opensolaris.opengrok.search.Search -R /var/opengrok/etc/configuration.xml -f 42826bf03710200044e0bfc8bcbe5d72 -p *test* -t java

and the query output I got is:

Your search "+full:42826 +full:bf03710200044e0bfc8bcbe5d72 +path:*test* +type:java"

If I change this to include quotes, it does the following: Command:

java -cp ./opengrok.jar org.opensolaris.opengrok.search.Search -R /var/opengrok/etc/configuration.xml -f \"42826bf03710200044e0bfc8bcbe5d72\" -p *test* -t java

Query output:

Your search "+full:"42826 bf03710200044e0bfc8bcbe5d72" +path:*test* +type:java"

I'm expecting that the following hash value 42826bf03710200044e0bfc8bcbe5d72 goes in as a single string

tarzanek commented 9 years ago

sounds like a bug just a sanity test - if you search for same hash from web UI does it work?

bond- commented 9 years ago

Yes, it's the same thing on the web UI as well. Please find the results below

If I use

42826bf03710200044e0bfc8bcbe5d72

in the full text search column, the generated query string is:

+full:42826 +full:bf03710200044e0bfc8bcbe5d72

screen shot 2015-04-24 at 10 39 59 am

If the search string is

"42826bf03710200044e0bfc8bcbe5d72"

(quotes inclusive): The query string is:

+full:"42826 bf03710200044e0bfc8bcbe5d72"

screen shot 2015-04-24 at 10 39 37 am

vladak commented 9 years ago

This is caused by first non-numeric character in the query which starts with a sequence of numbers; in the example above it's the b letter in the query.

If I enter query 11111111111111111111111111111111 and replace one of the 1 characters with a letter, say f, then the query will be always split just before the letter no matter where it is placed in the string.

For example:

tarzanek commented 9 years ago

definitely not obvious behaviour, if our customizations caused this, we need to fix, if this is lucene default, we should either report or adjust to be more natural (e.g. break query only on whitespace) @kahatlen any clues?

vladak commented 3 years ago

We don't ship the command line interface anymore however the problem is still valid.