jurismarches / luqum

A lucene query parser generating ElasticSearch queries and more !
Other
188 stars 40 forks source link

Fix precedence for unknown operation and boost #89

Closed JSCU-CNI closed 1 year ago

JSCU-CNI commented 1 year ago

The precedence for the unknown (implied) operation is always lower than for AND and OR operations, as evidenced by how the current Apache Lucene library handles this. A simple TreeVisitor on a PrecedenceQueryParser shows the following:

Default operator: OR
Entered query:    1 2 AND 3 4
Parsed query:     1 (+2 +3) 4
--------
BooleanQuery(MUST): 1 (+2 +3) 4
  BooleanQuery(SHOULD): 1 (+2 +3) 4
    TermQuery: 4
    TermQuery: 1
    BooleanQuery(MUST): +2 +3
      TermQuery: 2
      TermQuery: 3

Default operator: OR
Entered query:    1 2 OR 3 4
Parsed query:     1 (2 3) 4
--------
BooleanQuery(MUST): 1 (2 3) 4
  BooleanQuery(SHOULD): 1 (2 3) 4
    TermQuery: 4
    TermQuery: 1
    BooleanQuery(MUST): 2 3
      BooleanQuery(SHOULD): 2 3
        TermQuery: 3
        TermQuery: 2

Default operator: OR
Entered query:    1 2 OR 3 4 AND 5 OR 6 7 8
Parsed query:     1 (2 3) ((+4 +5) 6) 7 8
--------
BooleanQuery(MUST): 1 (2 3) ((+4 +5) 6) 7 8
  BooleanQuery(SHOULD): 1 (2 3) ((+4 +5) 6) 7 8
    TermQuery: 8
    TermQuery: 1
    TermQuery: 7
    BooleanQuery(MUST): 2 3
      BooleanQuery(SHOULD): 2 3
        TermQuery: 3
        TermQuery: 2
    BooleanQuery(MUST): (+4 +5) 6
      BooleanQuery(SHOULD): (+4 +5) 6
        TermQuery: 6
        BooleanQuery(MUST): +4 +5
          TermQuery: 4
          TermQuery: 5

The results are similar when the default operator is changed to AND.

This commit adds a fictitious token for the invisible implied operation, giving it the lowest precedence. This should ensure that a b AND c d correctly translates to a (b AND c) d, contrasting the current (a (b AND (c d)).

To make this work, the list of precedence items had to shrink quite a bit, as there were several tokens in there that do not require disambiguation, and did cause issues with the fact that there's no lookahead token for the implied operation.

In the cleanup of this list, we also noticed that precedence of the boost operator was configured incorrectly, so that was fixed as well. This ensures that +2^1 becomes +(2^1).

alexgarel commented 1 year ago

@mmoriniere hum, sorry I merged a bit rapidly, I might have asked you first.

But I think it's ok as all tests passes.

Maybe it could be worth a minor release.

mmoriniere commented 1 year ago

@alexgarel It's OK, and I made a minor release.

https://pypi.org/project/luqum/0.12.1/