Closed ramyala closed 6 years ago
Thanks for the report, I will try to look into this this week.
@ramyala 95cd9d97722a38ae3abdbd90610dd4e3f515302a should fix it.
I wait for your validation before closing.
Also I will study how to setup an integration test against elasticsearch (would be cool).
The issue still persists based on quotes/spaces in the query:
query: some_id: "ABC-123"
luqum: {'match_phrase': {'some_id': {'query': 'ABC-123'}}}
query: some_id:ABC-123
luqum: {'match': {'some_id': {'query': 'ABC-123', 'zero_terms_query': 'none'}}}
query: some_id: ABC-123
luqum: {'match': {'some_id': {'query': 'ABC-123', 'zero_terms_query': 'none'}}}
Note, only the first query was fixed with your changes.
One of the differences here is the newer version of luqun produces match vs. match_phrase for the query some_id:ABC-123
. The zero_terms_query doesn't seem right either for match.
@ramyala, this is indeed a change, but it should normally give same results (for what I can anticipate). I introduced it to be more consistent (but you can now force the request type, by adding field_options={"some_id": {"type": "match_phrase"}}
to ElasticsearchQueryBuilder).
Could you test the query results against your server ? IMO they will be the same with match_phrase and match for a single term.
If it is not the same, let me know, we may try to keep compatibility.
On git master, I've noticed different results because of missing type: 'phrase'
kv in the map.
master generates:
query: some_id: "ABC-123"
luqum: {'match_phrase': {'some_id': {'query': 'ABC-123'}}}
> 1 result
query: some_id:ABC-123
luqum: {'match': {'some_id': {'query': 'ABC-123', 'zero_terms_query': 'none'}}}
> many results
query: some_id: ABC-123
luqum: {'match': {'some_id': {'query': 'ABC-123', 'zero_terms_query': 'none'}}}
> many results
Whereas 0.6.1 generates:
query: some_id: "ABC-123"
luqum: {'match_phrase': {'some_id': {'query': 'ABC-123'}}}
> 1 result
query: some_id:ABC-123
luqum: {'match': {'some_id': {'query': 'ABC-123', 'type': 'phrase', 'zero_terms_query': 'none'}}}
> 1 result
query: some_id: ABC-123
luqum: {'match': {'some_id': {'query': 'ABC-123', 'type': 'phrase', 'zero_terms_query': 'none'}}}
> 1 result
I think its worth deprecating 0.7.x because of the numerous regressions from 0.6.1. These are the only queries I've tried but there could be more which we are missing.
hello @ramyala, thanks for the clear report. It seems I've done some wrong assumptions.
So as you suggest, I will try to turn back, someway to old behaviour (that is using match_phrase) and release a 7.3, while removing 7.2 files from pypi (and signal it is deprecated in CHANGELOG).
I would be quite interested yet in digging what is the difference between a match and a match_phrase query , when query is a single word.
I have dug a bit and the difference comes from the fact that ABC-123, when analyzed, maybe treated as "ABC 123" (eg. by the compound word delimiter token filter) which gives different results in case of match
(matching one term is sufficient) and match_phrase
.
That said, if you use {query_string':'some_id: ABC-123'}
the behaviour is that of match and not that of match_phrase (that is in your case many results).
If you instead use {query_string':'some_id: "ABC-123"'}
the behaviour is that of match_phrase (that is in your case 1 result.
@ramyala as the goal of luqum is to keep consistent with query_string
(at least by default), it looks like a fix to me (not a regression).
Although I understand it hits you badly ! My proposal then is not to fix match
to match_phrase
, for one word expression by default, but I propose:
would that be ok for you ?
PS: I attach a ipynb file (in markdown and pdf format) that shows what I say on a 2.3 instance (I would be interested if you can run it on a 5 or 6 instance, but I do not expect differences).
hi @alexgarel
I have been looking at this and your analysis explains this behavior perfectly. The fix would actually have to be on our end to either change the analysis filter or use keyword
tokenizer to index words with hyphens in them.
So, I'm ok with the current behavior of the fixes in master and adding a way to force the query builder to output match_phrase if need be.
@ramyala I made a new commit.
I wait for your eventual validation before releasing on pypi.
I've tested the quoted / unquoted versions of the query with master. These changes look good. Thanks for the fix!
0.7.2 released.
There is a regression since 0.7.x where
zero_terms_query: none
is being appended to generated query formatch_phrase
For query
some_id: hello-world
0.6.1
generated{'match_phrase': {'some_id': {'query': hello-world'}}}
0.7.1
generated{'match_phrase': {'participant_id': {'query': 'LB-S00133', 'zero_terms_query': 'none'}}}
Note that the 0.7.1 query fails on elasticsearch
5.6.8
with:TransportError(400, 'parsing_exception', '[match_phrase] query does not support [zero_terms_query]')