Closed aglassdarkly closed 6 years ago
Thanks for the report, I will try to look into this soon.
We're using the luqum parser to help transform a subset of Lucene syntax into SQL. Other than the lack of escape handling, it has been really useful. Thanks!
@aglassdarkly it should work now on master.
Could you check that it matches your use case ?
Thanks for getting to this so quickly!
Edit: The parser is breaking when (){}[] or : is the first character of a word.
Pushed a fix, can you check ?
Works beautifully!
I found a small issue. Wildcard expressions can't be differentiated from a word where the wildcard character was escaped. Thus, id:\*something\* looks like id:*something* after parsing. However, they may need to be treated differently. The escaped version is just a normal string that includes a * but the unescaped version signifies a wildcard search.
good catch @aglassdarkly, so is it for ? I suppose.
@aglassdarkly maybe escape char removal that I did is a bad idea, and we should keep'em, I see that on the elasticsearch side, match does not have problem with them.
How is it on your side, is it better if we keep the '\' in the value ? (I would however method to remove them easily)
I think leaving the escape chars in and adding a method to remove them is the best approach. This would allow differentiation between a wildcard expression (unescaped wildcard char) and a normal word (escaped wildcard char), allowing me to generate appropriate SQL.
@aglassdarkly new commit to check 7e7a73e084f099b44226066dcdf33f5a3989aed7
unescaped_value
property for Termwildcards_iter
method on Term
(I thought it would be useful to you)I have verified that ElasticSearch (at least in 2.3) supports keeping escaped chars in match
and match_phrase
query. That means I don't have to modify the elasticsearch query generation part.
Thanks for the updates. I think this is a good approach.
I didn't find wildcards_iter() useful in my specific case. Here is what I ended up doing:
def sub(val):
if val == '*':
return '%'
elif val == '?':
return '_'
else:
return Term(val).unescaped_value
unescaped = ''.join([sub(p) for p in item.expr.WILDCARDS_PATTERN.split(item.expr.value)])
Adding a split_wildcards() method would avoid my using the regex directly and might be useful for others with similar use cases.
Hi @aglassdarkly, thanks, I will look into this next week.
Hi @aglassdarkly, is 61dd13dff6088a92380a2e63ed5105e206473a48 doing the job ?
If, so I let you close the ticket !
Looks good. Thanks working with me on this.
The parser doesn't handle escaped characters.
See: https://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Escaping%20Special%20Characters