jurismarches / luqum

A lucene query parser generating ElasticSearch queries and more !
Other
188 stars 40 forks source link

zero_terms_query appended to match_phrase query #23

Closed ramyala closed 6 years ago

ramyala commented 6 years ago

There is a regression since 0.7.x where zero_terms_query: none is being appended to generated query for match_phrase

For query some_id: hello-world

0.6.1 generated {'match_phrase': {'some_id': {'query': hello-world'}}} 0.7.1 generated {'match_phrase': {'participant_id': {'query': 'LB-S00133', 'zero_terms_query': 'none'}}}

Note that the 0.7.1 query fails on elasticsearch 5.6.8 with: TransportError(400, 'parsing_exception', '[match_phrase] query does not support [zero_terms_query]')

alexgarel commented 6 years ago

Thanks for the report, I will try to look into this this week.

alexgarel commented 6 years ago

@ramyala 95cd9d97722a38ae3abdbd90610dd4e3f515302a should fix it.

I wait for your validation before closing.

Also I will study how to setup an integration test against elasticsearch (would be cool).

ramyala commented 6 years ago

The issue still persists based on quotes/spaces in the query:

query: some_id: "ABC-123"
luqum: {'match_phrase': {'some_id': {'query': 'ABC-123'}}}

query: some_id:ABC-123
luqum: {'match': {'some_id': {'query': 'ABC-123', 'zero_terms_query': 'none'}}}

query: some_id: ABC-123
luqum: {'match': {'some_id': {'query': 'ABC-123', 'zero_terms_query': 'none'}}}

Note, only the first query was fixed with your changes.

ramyala commented 6 years ago

One of the differences here is the newer version of luqun produces match vs. match_phrase for the query some_id:ABC-123. The zero_terms_query doesn't seem right either for match.

alexgarel commented 6 years ago

@ramyala, this is indeed a change, but it should normally give same results (for what I can anticipate). I introduced it to be more consistent (but you can now force the request type, by adding field_options={"some_id": {"type": "match_phrase"}} to ElasticsearchQueryBuilder).

Could you test the query results against your server ? IMO they will be the same with match_phrase and match for a single term.

If it is not the same, let me know, we may try to keep compatibility.

ramyala commented 6 years ago

On git master, I've noticed different results because of missing type: 'phrase' kv in the map.

master generates:

query: some_id: "ABC-123"
luqum: {'match_phrase': {'some_id': {'query': 'ABC-123'}}}
> 1 result

query: some_id:ABC-123
luqum: {'match': {'some_id': {'query': 'ABC-123', 'zero_terms_query': 'none'}}}
> many results

query: some_id: ABC-123
luqum: {'match': {'some_id': {'query': 'ABC-123', 'zero_terms_query': 'none'}}}
> many results

Whereas 0.6.1 generates:

query: some_id: "ABC-123"
luqum: {'match_phrase': {'some_id': {'query': 'ABC-123'}}}
> 1 result

query: some_id:ABC-123
luqum: {'match': {'some_id': {'query': 'ABC-123', 'type': 'phrase', 'zero_terms_query': 'none'}}}
> 1 result

query: some_id: ABC-123
luqum: {'match': {'some_id': {'query': 'ABC-123', 'type': 'phrase', 'zero_terms_query': 'none'}}}
> 1 result
ramyala commented 6 years ago

I think its worth deprecating 0.7.x because of the numerous regressions from 0.6.1. These are the only queries I've tried but there could be more which we are missing.

alexgarel commented 6 years ago

hello @ramyala, thanks for the clear report. It seems I've done some wrong assumptions.

So as you suggest, I will try to turn back, someway to old behaviour (that is using match_phrase) and release a 7.3, while removing 7.2 files from pypi (and signal it is deprecated in CHANGELOG).

I would be quite interested yet in digging what is the difference between a match and a match_phrase query , when query is a single word.

alexgarel commented 6 years ago

I have dug a bit and the difference comes from the fact that ABC-123, when analyzed, maybe treated as "ABC 123" (eg. by the compound word delimiter token filter) which gives different results in case of match (matching one term is sufficient) and match_phrase.

That said, if you use {query_string':'some_id: ABC-123'} the behaviour is that of match and not that of match_phrase (that is in your case many results).

If you instead use {query_string':'some_id: "ABC-123"'} the behaviour is that of match_phrase (that is in your case 1 result.

@ramyala as the goal of luqum is to keep consistent with query_string (at least by default), it looks like a fix to me (not a regression).

Although I understand it hits you badly ! My proposal then is not to fix match to match_phrase, for one word expression by default, but I propose:

would that be ok for you ?

PS: I attach a ipynb file (in markdown and pdf format) that shows what I say on a 2.3 instance (I would be interested if you can run it on a 5 or 6 instance, but I do not expect differences).

ramyala commented 6 years ago

hi @alexgarel

I have been looking at this and your analysis explains this behavior perfectly. The fix would actually have to be on our end to either change the analysis filter or use keyword tokenizer to index words with hyphens in them.

So, I'm ok with the current behavior of the fixes in master and adding a way to force the query builder to output match_phrase if need be.

alexgarel commented 6 years ago

@ramyala I made a new commit.

I wait for your eventual validation before releasing on pypi.

ramyala commented 6 years ago

I've tested the quoted / unquoted versions of the query with master. These changes look good. Thanks for the fix!

alexgarel commented 6 years ago

0.7.2 released.