jurismarches / luqum

A lucene query parser generating ElasticSearch queries and more !
Other
187 stars 42 forks source link

add escaping #24

Closed aglassdarkly closed 6 years ago

aglassdarkly commented 6 years ago

The parser doesn't handle escaped characters.

See: https://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Escaping%20Special%20Characters

alexgarel commented 6 years ago

Thanks for the report, I will try to look into this soon.

aglassdarkly commented 6 years ago

We're using the luqum parser to help transform a subset of Lucene syntax into SQL. Other than the lack of escape handling, it has been really useful. Thanks!

alexgarel commented 6 years ago

@aglassdarkly it should work now on master.

Could you check that it matches your use case ?

aglassdarkly commented 6 years ago

Thanks for getting to this so quickly!

Edit: The parser is breaking when (){}[] or : is the first character of a word.

alexgarel commented 6 years ago

Pushed a fix, can you check ?

aglassdarkly commented 6 years ago

Works beautifully!

aglassdarkly commented 6 years ago

I found a small issue. Wildcard expressions can't be differentiated from a word where the wildcard character was escaped. Thus, id:\*something\* looks like id:*something* after parsing. However, they may need to be treated differently. The escaped version is just a normal string that includes a * but the unescaped version signifies a wildcard search.

alexgarel commented 6 years ago

good catch @aglassdarkly, so is it for ? I suppose.

alexgarel commented 6 years ago

@aglassdarkly maybe escape char removal that I did is a bad idea, and we should keep'em, I see that on the elasticsearch side, match does not have problem with them.

How is it on your side, is it better if we keep the '\' in the value ? (I would however method to remove them easily)

aglassdarkly commented 6 years ago

I think leaving the escape chars in and adding a method to remove them is the best approach. This would allow differentiation between a wildcard expression (unescaped wildcard char) and a normal word (escaped wildcard char), allowing me to generate appropriate SQL.

alexgarel commented 6 years ago

@aglassdarkly new commit to check 7e7a73e084f099b44226066dcdf33f5a3989aed7

I have verified that ElasticSearch (at least in 2.3) supports keeping escaped chars in match and match_phrase query. That means I don't have to modify the elasticsearch query generation part.

aglassdarkly commented 6 years ago

Thanks for the updates. I think this is a good approach.

I didn't find wildcards_iter() useful in my specific case. Here is what I ended up doing:

def sub(val):
    if val == '*':
        return '%'
    elif val == '?':
        return '_'
    else:
        return Term(val).unescaped_value
unescaped = ''.join([sub(p) for p in item.expr.WILDCARDS_PATTERN.split(item.expr.value)])

Adding a split_wildcards() method would avoid my using the regex directly and might be useful for others with similar use cases.

alexgarel commented 6 years ago

Hi @aglassdarkly, thanks, I will look into this next week.

alexgarel commented 6 years ago

Hi @aglassdarkly, is 61dd13dff6088a92380a2e63ed5105e206473a48 doing the job ?

If, so I let you close the ticket !

aglassdarkly commented 6 years ago

Looks good. Thanks working with me on this.