jurismarches / luqum

A lucene query parser generating ElasticSearch queries and more !
Other
188 stars 40 forks source link

Keyword fields containing wildcards cannot be searched for exactly #78

Closed zeitderforschung closed 1 year ago

zeitderforschung commented 2 years ago

Thank you very much, you have created a really amazing library. 👍🏻

I have come across a special case. I have keyword fields that contain wildcard characters (* or ?). In Elasticsearch this is no problem at all. But it seems luqum has some difficulties with this use case.

Here is an example of indexing a document with a keyword field containing wildcard characters using ES.

from elasticsearch import Elasticsearch

es = Elasticsearch(hosts="http://localhost:9200")
mappings = {"properties":{"vendor":{"type":"keyword"}}}
es.indices.create(index="test", mappings=mappings)
es.index(index="test", body={"vendor": "f**k"}, id="example")

Now I want to search for the field. The following works, but is not what I want, because it does a wildcard search and not an exact term search.

es.search(body={
    "query": {
        "query_string": {
            "query": "vendor:f**k"
        }
    }
}, index="test")
{'took': 2,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 1, 'relation': 'eq'},
  'max_score': 1.0,
  'hits': [{'_index': 'test',
    '_id': 'example',
    '_score': 1.0,
    '_source': {'vendor': 'f**k'}}]}}

(1) To search exact you have to escape the wildcard characters. This works in ES.

es.search(body={
    "query": {
        "query_string": {
            "query": "vendor:f\*\*k"
        }
    }
}, index="test")
{'took': 1,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 1, 'relation': 'eq'},
  'max_score': 0.2876821,
  'hits': [{'_index': 'test',
    '_id': 'example',
    '_score': 0.2876821,
    '_source': {'vendor': 'f**k'}}]}}

(2) Alternatively you can also use a phrase query. This works in ES.

es.search(body={
    "query": {
        "query_string": {
            "query": 'vendor:"f\*\*k"'
        }
    }
}, index="test")
{'took': 1,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 1, 'relation': 'eq'},
  'max_score': 0.2876821,
  'hits': [{'_index': 'test',
    '_id': 'example',
    '_score': 0.2876821,
    '_source': {'vendor': 'f**k'}}]}}

Now when I try both (1) and (2) with luqum, it doesn't seem to work.

from luqum.elasticsearch import SchemaAnalyzer, ElasticsearchQueryBuilder
schema_analizer = SchemaAnalyzer({"mappings": mappings})
es_builder = ElasticsearchQueryBuilder(**schema_analizer.query_builder_options())

(1) Luqum creates a wildcard query when the "*" characters are escaped. This behaviour is different from ES and not what I expected. Apparently the escape characters are not removed either.

from luqum.parser import parser
es_builder(parser.parse("vendor:f\*\*k"))
 {'wildcard': {'vendor': {'value': 'f\\*\\*k'}}}

(2) Luqum creates a wildcard query when the search term is entered as a phrase. This behaviour is also different from ES and not what I expected.

from luqum.parser import parser
es_builder(parser.parse('vendor:"f**k"'))
{'wildcard': {'vendor': {'value': 'f**k'}}}

Somehow I don't see any possibilities to formulate a query string in such a way that a term with "*" can be searched for exactly.

Regards, André