jurismarches / luqum

A lucene query parser generating ElasticSearch queries and more !
Other
188 stars 40 forks source link

Setting OR operation to only work on adjacent values without needing brackets #42

Closed huntfx closed 5 years ago

huntfx commented 5 years ago

Short version is I would like to parse a b OR c d as a AND (b OR c) AND d.

Long version: I'm attempting to simplify the syntax a little before I integrate this into my code, as it's supposed to be an easy to use search with optional advanced features. I gave up coding it myself and found this library which seems nice.

I'd like it to work like Google where things are "AND" by default, but by providing "OR", it'll compare the two closest values. I've set UnknownOperationResolver to AND and tried changing the order of parser.precedence, but not had any luck.

Here's an example of a query I'd like to use:

Working syntax: (either page 1 or page 2, has either "a" or both "b" and "c", title is not sometitle)

(page:page1 OR page:page2) AND (a OR b AND c) AND -title:sometitle

Wanted syntax:

page:page1 OR page:page2 a OR (b c) -title:sometitle

For the record, we're still stuck on Python 2.7 for another year or two, so as I've already had to fix the yield from lines, I'm not against tweaking other bits of the code if needed.

alexgarel commented 5 years ago

Hello @Peter92.

I'm not sure of what you are trying to do. Should your request become an elasticsearch request, or do you just want to get the parsing tree ?

However changing operator precedence seems the right thing to do in your case.

To change OR to AND for implicit operator, you can use the default_operator parameter in ElasticsearchQueryBuilder.

If come back with some code, and the error or incorrect behaviour you get, I may be able to help.

huntfx commented 5 years ago

Ah thanks, I'm just using the tree as I'm combining this with SQLAlchemy, I figured it'd be easier to recursively iterate over the children and build up the query. I'm just wanting the tree to build itself a little differently, so that an OR operator will only apply to the two closest values.

Quick example again similar to above, but with the tree:

>>> parser.parse('a OR b c d')             # the same as ((a OR b) AND c AND d)
#output: OrOperation(Word('a'), AndOperation(Word('b'), Word('c'), Word('d')))
#wanted: AndOperation(OrOperation(Word('a'), Word('b')), Word('c'), Word('d'))

Am I able to apply your suggestion just on the tree, or is that only for the elastic search thing?

I found a minor bug btw, parser.parse("'word'") evaluates as Word("word'").

alexgarel commented 5 years ago

No, the implicit operation resolves to UnknownOperation.

In [6]: parser.parse('a OR b c d')                                                                  
Out[6]: OrOperation(Word('a'), UnknownOperation(Word('b'), Word('c'), Word('d')))

So you can either :

And thanks for the small bug report !

huntfx commented 5 years ago

It was a slightly incomplete example as I was calling UnknownOperationResolver(AndOperator)(tree), ignoring that though haha, the tree result I would like then is this:

UnknownOperation(OrOperation(Word('a'), Word('b')), Word('c'), Word('d')))

By default, a OR b c d is being parsed as a OR (b c d), but I would like it parsed as (a OR b) c d.

huntfx commented 5 years ago

Actually I think I just got it working. Moving the ('left', 'OR_OP',) line to the bottom of luqum.parser.precedence seems to have done what I wanted, from very light testing it doesn't seem to have broken anything.

I know it's not a common use case, but perhaps it'd be quite nice if there was some option to modify the order without editing the source code.

alexgarel commented 5 years ago

You can do this in your own file:

from luqum.parser import *                                                                  

precedence = list(precedence)
or_op = precedence.pop(0)
precedence.append(or_op)
precedence = tuple(precedence)                                                             
myparser = yacc.yacc()
tree = myparser.parse('a OR b c d')
print(tree, "is", repr(tree))

This outputs: a OR b c d is UnknownOperation(OrOperation(Word('a'), Word('b')), Word('c'), Word('d'))

NB: you can't do it in REPL because yacc.yacc() needs to load the file.

PS: indeed, putting or_op in second position does not work if you use implicit operator, but would work if operations where AND. this is because, it as to be considered after the "TERM" operation, as "b c d" only contains TERM.

huntfx commented 5 years ago

Got it working perfectly with SQLAlchemy now, thanks a bunch for coding this.

I just have a final quick question so I'll ask here instead of opening a new issue.

Would it be possible modify the parsing to allow empty search terms, so that "description: name:" for example will return something like UnknownOperation(SearchField('description', None), SearchField('name': None))?

I tried editing p_field_search but it doesn't get that far before throwing an error.

alexgarel commented 5 years ago

In current grammar, foo:bar:(test) is valid and interpreted as SearchField('spam', SearchField('foo', Word('bar'))) (this permits complex nested or object search).

So to support what you want, you would have to rewrite a good part of the grammar.

"TERM COLUMN unary_expression" rule should not stand any-more, as you want to put restrictions on the right part (it can't be a search). So now you have to differentiate unary_expression and a new search_expression.

Also because space are not considered, luqum can't differentiate "name: a" from "name:a". While I imagine you want the first to be "name:None OR a" and the second "name:a".

In Lucene, to make a match on every object with a non empty "name", it is written "name:*".

If you just want to tweak the product, I would suggest using a regexp like so:

marker = "*"  # use something you know want be used by user
empty_search = re.compile(r'(?<expr>{term_re}:)\s+'.format(term_re=TERM_RE))
expression = empty_search.sub("\g<expr>" + marker, expression)

Then when you meet marker you can use None.

huntfx commented 5 years ago

Ah thanks for the suggestions, I didn't realise it'd be quite a big tweak.

I couldn't get the gammar edits to work (I'm running a bit blind with ply and don't really understand what counts as valid syntax for it), however based your regex idea, I've got it working by just checking the string for any occurance of : that isn't contained in speech marks, and adding in a null marker.

Other than for nested search fields which I hadn't considered before, It's now fully hooked up to the database and works perfectly, so thanks again :)