tballison / lucene-addons

Standalone versions of LUCENE_5205 and other patches: SpanQueryParser, Concordance and Co-occurrence stats
Apache License 2.0
18 stars 1 forks source link

"(2d,3d)-print" fails to parse. #5

Closed modassar81 closed 8 years ago

modassar81 commented 8 years ago

Hi Tim,

"(2d,3d)-print" fails to parse with following exception. Please look into it.

org.apache.lucene.queryparser.classic.ParseException: Can't process field, boolean operators or a match all docs query in a pure span. at org.apache.lucene.queryparser.spans.AbstractSpanQueryParser._parsePureSpanClause(AbstractSpanQueryParser.java:88) at org.apache.lucene.queryparser.spans.SpanQueryParser.parseRecursively(SpanQueryParser.java:287) at org.apache.lucene.queryparser.spans.SpanQueryParser._parse(SpanQueryParser.java:234) at org.apache.lucene.queryparser.spans.SpanQueryParser.parse(SpanQueryParser.java:222)

Regards, Modassar

modassar81 commented 8 years ago

Following is the more simpler query which is failing to parse. "(3d)-print"

tballison commented 8 years ago

Y, the issue is that the '-' is treated as a Boolean operator. If you are searching for the literal token "-print", surround it with single quotes. If you are searching for "3d" in documents that do not contain "print", then that's a Boolean operator and the SpanOnlyParser is intended to fail.

I can't remember off the top of my head if the classic queryparser requires a space before the "-" for it to be interpreted as a Boolean operator. If it does, then I'll change the behavior of the SpanQueryParser and SpanOnlyParser. If not, then you'll have to use single quotes. Should have a chance to look into this later today.

tballison commented 8 years ago

Y, that's the behavior of the classic query parser. Even if we disagree with it, I want to keep the behavior of the SpanQueryParser(s) as close as possible to the classic query parser.

public void testNot() throws Exception { QueryParser p = new QueryParser("f", analyzer); Query q = p.parse("(3d)-print"); assertTrue(q instanceof BooleanQuery); BooleanQuery bq = (BooleanQuery)q; assertEquals(BooleanClause.Occur.MUST_NOT, bq.clauses().get(1).getOccur()); }

tballison commented 8 years ago

Doh, that's with the SpanQueryParser, not the SpanOnlyParser...

I'm not able to replicate this problem with the SpanQueryParser in lucene5.4on-0.1: SpanQueryParser yields: f1:3d -f1:print

Which branch are you using? Did you rename SpanOnlyParser?

modassar81 commented 8 years ago

I'm not able to replicate this problem with the SpanQueryParser in lucene5.4on-0.1: The issue is with phrase. "(3d)-print" Just for reference: Query q = p.parse("\"(3d)-print\"");

That is SpanQueryParser only. I am not using SpanOnlyParser. I am using it with lucene5.4on-0.1. I am able to consistently reproduce it. Following is the sample code. SpanQueryParser parser = new SpanQueryParser(field, analyzer, multiTermAnalyzer); parser.setAllowLeadingWildcard(true); Query query = parser.parse("\"(3d)-print\"");

Please correct me if I am wrong that a hyphen in double quotes should be taken as hyphen and not minus. And the issue is with hyphen followed by close bracket like in the example "(3d)-print"

modassar81 commented 8 years ago

Hi Tim,

Kindly look into the issue.

Thanks, Modassar

tballison commented 8 years ago

Y, that is expected behavior with the SpanQueryParser. The double-quotes mean that you are in a proximity query which forces parsing into a SpanQuery.

If you are searching for a single term '(3d)-print', use single quotes. If you are searching for '2d' or '3d' near '-print', use single quotes around '-print'

modassar81 commented 8 years ago

If I use '(3d)-print'/(3d)'-print' I am getting following exception. org.apache.lucene.queryparser.classic.ParseException: Didn't find matching: ' at org.apache.lucene.queryparser.spans.SpanQueryLexer.readToMatchingEndToken(SpanQueryLexer.java:646) at org.apache.lucene.queryparser.spans.SpanQueryLexer.nextToken(SpanQueryLexer.java:160) at org.apache.lucene.queryparser.spans.SpanQueryLexer.getTokens(SpanQueryLexer.java:102) at org.apache.lucene.queryparser.spans.SpanQueryParser._parse(SpanQueryParser.java:232) at org.apache.lucene.queryparser.spans.SpanQueryParser.parse(SpanQueryParser.java:222)

Following queries is getting transformed to two tokens. 1. 3d 2. '-print' "(3d)'-print'" "'(3d)-print'" "(3d)'-'print"

I want to match exact (3d)-print. Please help me understand if I am missing something.

tballison commented 8 years ago

Looked into this a bit more...Should always run unit tests before I respond...sorry.

public void testDebug() throws Exception { //example to show escaping compareHits("'(d2)-print'", 14); }

I'm not able to reproduce that exception with '(d2)-print' as in the above.

The parser is applying the Analyzer to whatever is between the single quotes --'(d2)-print'. So, if you use a WhitespaceTokenizer, you'll get what you want...I think:

org.apache.lucene.search.TermQuery : f1:(d2)-print

If you use the SimpleTokenizer, this will tokenize the string into 'd' and 'print', and then, following the general rules, this will be refashioned into a PhraseQuery:

org.apache.lucene.search.PhraseQuery : f1:"d print"

This is exactly the same behavior as the classic QueryParser if you surround the token -- (3d)-print -- with double quotes rather than single quotes.

modassar81 commented 8 years ago

I am using WhiteSpaceTokenizer but that is also breaking "'(3d)-print'" into two tokens. Following is my analyzer chain:

Tokenizer source = new WhitespaceTokenizer(); TokenStream stopTStrem = new StopFilter(source, new CharArraySet(DEFAULT_STOP_SET, true)); TokenStream icuFTStream = new ICUFoldingFilter(stopTStrem); TokenStream pattern1FStream = new PatternReplaceFilter(icuFTStream, pattern1, "", true); TokenStream krfStream = new KeywordRepeatFilter(pattern1FStream); TokenStream kStream = new KStemFilter(krfStream); TokenStream rdtStream = new RemoveDuplicatesTokenFilter(kStream);

I am not able to achieve this exact match for the query.

tballison commented 8 years ago

Doesn't that mean that at indexing time, that token will be broken into two tokens?

Can you confirm that (3d)-print is actually being indexed as a single token?

Is one of the filters breaking that into two tokens?

Can you experiment with removing some of those filters and seeing if you're getting the same behavior?

Can you confirm that you're getting the same behavior with the classic QueryParser and double quotes?

modassar81 commented 8 years ago

Yes you are right. I think it will be broken into tokens on hyphen. I will confirm and let you know the exact tokens. I missed this part while debugging the issue. Thanks for the pointer. I will confirm the behavior and verify the classic query parser with phrase. Thanks again.

modassar81 commented 8 years ago

Doesn't that mean that at indexing time, that token will be broken into two tokens? The analyzer I provided above is not breaking it. I am doing some preprocessing on query which is basically breaking the query token. Sorry I missed this part of my code.

Can you confirm that (3d)-print is actually being indexed as a single token? It is broken into different token and indexed same way. I have preservedOriginal enabled as well.

Is one of the filters breaking that into two tokens? As mentioned above my query is broken into tokens because of preprocessing and not by the analyzer. Yes during indexing it is breaking into multiple tokens by WordDelimiterFilter.

Can you experiment with removing some of those filters and seeing if you're getting the same behavior? I verified the behavior as explained above. Thanks for the pointers.

Can you confirm that you're getting the same behavior with the classic QueryParser and double quotes?

I tested with following two methods. The analyzer chain used is same which I had mentioned in my comments above.

public void withClassicParser(){
    QueryParser parser = new QueryParser("f", new myAnalyzer());
    try {
        Query query = parser.parse("\"(3d)-print\"");
        System.out.println(query.toString());
    } catch (ParseException e) {
        e.printStackTrace();
    }
}

Output: f:3d-print

public void withSpanQueryParser(){
    SpanQueryParser parser = new SpanQueryParser("f", new myAnalyzer(), new myMultiTermAnalyzer());
    try {
        Query query = parser.parse("\"(3d)-print\"");
        System.out.println(query.toString());
    } catch (ParseException e) {
        e.printStackTrace();
    }
}

Output:
org.apache.lucene.queryparser.classic.ParseException: Can't process field, boolean operators or a match all docs query in a pure span.
at org.apache.lucene.queryparser.spans.AbstractSpanQueryParser._parsePureSpanClause(AbstractSpanQueryParser.java:88)
at org.apache.lucene.queryparser.spans.SpanQueryParser.parseRecursively(SpanQueryParser.java:287)
at org.apache.lucene.queryparser.spans.SpanQueryParser._parse(SpanQueryParser.java:234)
at org.apache.lucene.queryparser.spans.SpanQueryParser.parse(SpanQueryParser.java:222)

Classic query parser returns f:3d-print whereas the SpanQueryParser throws ParseException.

Few other tests:

public void withSpanQueryParserSingleQuoted(){
    SpanQueryParser parser = new SpanQueryParser("f", new myAnalyzer(), new myMultiTermAnalyzer());
    try {
        Query query = parser.parse("\"'(3d)-print'\"");
        System.out.println(query.toString());
    } catch (ParseException e) {
        e.printStackTrace();
    }
}

Output: f:3d-print

public void withSpanQueryParserSingleQuoted1(){
    SpanQueryParser parser = new SpanQueryParser("f", new myAnalyzer(), new myMultiTermAnalyzer());
    try {
        Query query = parser.parse("'(3d)-print'");
        System.out.println(query.toString());
    } catch (ParseException e) {
        e.printStackTrace();
    }
}

Output: f:3d-print`
tballison commented 8 years ago

Got it...I think.

My suggestion is that double quotes in classic QueryParser should have the same behavior as single quotes in the SpanQueryParser... And, I think, this is what you've found. In short, I think the SpanQueryParser is behaving as it should.

As you've found, double quotes in SpanQueryParser will result in an exception because of the difference in how proximity queries are built with the SpanQueryParser vs the classic QueryParser.

So, given the equivalence of these two:

public void withClassicParser(){
    QueryParser parser = new QueryParser("f", new myAnalyzer());
    try {
        Query query = parser.parse("\"(3d)-print\"");
        System.out.println(query.toString());
    } catch (ParseException e) {
        e.printStackTrace();
    }
}

Output: f:3d-print
public void withSpanQueryParserSingleQuoted1(){
    SpanQueryParser parser = new SpanQueryParser("f", new myAnalyzer(), new myMultiTermAnalyzer());
    try {
        Query query = parser.parse("'(3d)-print'");
        System.out.println(query.toString());
    } catch (ParseException e) {
        e.printStackTrace();
    }
}

Output: f:3d-print

How do you want the behavior to differ?

modassar81 commented 8 years ago

Hi Tim.

Just was thinking that if the query syntax in both examples can be same? Thanks for your explanation.

Best, Modassar

tballison commented 8 years ago

Hi Modassar, Unfortunately, I can't think of a way that it can be the same. There is a fundamental (and unfortunately subtle) difference between the handling of proximity queries with the classic parser and the SpanQueryParser. With the SQP, there needs to be further potentially structural parsing within a proximity query...so that a query for "match on 'a' or 'b' right before 'c'" -> "(a b) c" is different from "match on a literal token '(a b) c'" -> '(a b) c'. I tried as much as possible to retain the classic syntax...here, though, I don't think it works. Can you think of any way around this?

       Best,

                  Tim