jurismarches / luqum

A lucene query parser generating ElasticSearch queries and more !
Other
188 stars 40 forks source link

SearchField.name with spaces #49

Closed huntfx closed 4 years ago

huntfx commented 4 years ago

Luqum is mostly working perfectly for me, but I've just hit a bit of a snag. I have a double SearchField to perform a more advanced query, but I've just realised it won't work with spaces.

>>> field:value1:"value 2"
SearchField('field', SearchField('value1', Phrase('"value 2"')))

>>> field:"value 1":"value 2"`
luqum.parser.ParseError: Syntax error in input at LexToken(COLUMN,':',1,24)!

Is this a limitation of yacc or is there a way I could get this working?

In the meantime, I've used this to convert "value 1" to value 1, it's awfully messy though.

# This will convert 'field:"value 1":"value 2"' to 'field:value 1:"value 2"'
# It will need to be decoded again before being used
offset = 0
while True:
    try:
        index = value[offset:].index(':"') + 1
    except ValueError:
        break
    offset += index
    end = False
    for i, c in enumerate(value[offset:]):
        if not end:
            if i and c == '"':
                end = True
        elif c == ' ':
            break
        elif c == ':':
            word = value[offset:offset+i]
            new_word = word[1:-1].replace(' ', ' ')
            value = value[:offset] + new_word + value[offset+i:]
            offset += len(new_word) - len(word)
            break
alexgarel commented 4 years ago

Hi @Peter92.

Not having spaces in search field names is not a limitation of PLY, but a limitation of the parser as it is defined right now.

I think it's better for luqum to stick to the Lucene Query Language definition, which does not permits spaces in field names, (afaik). Or is that permitted by Elastic Search ? In wich case we may change luqum behaviour.

However, what you can do is write your own parser.py, which import everything from luqum parser.py but changes some definition. If I guess correctly, you'll just have to redefine p_field_search docstring from:

    '''unary_expression : TERM COLUMN unary_expression'''

to:

    '''unary_expression : term_or_phrase COLUMN unary_expression'''

(and maybe remove the quotes if they are present in p[1] according to your preferences).

And at the end of your file, have again the parser creation

huntfx commented 4 years ago

Thanks for the help, the syntax is definitely the weirdest I've seen in Python.

I used phrase_or_term which I assume you meant, and while it enables the use of speech marks, it comes up with a syntax error if anything is after the search term, such as "a:b OR c" saying Syntax error in input at LexToken(OR_OP,'OR',1,4).

huntfx commented 4 years ago

Actually it doesn't matter too much. Since it's only the tree generation I'm actually using, I've been messing around building a simpler parser using luqum as inspiration. It's a lot slower, but at least I can add custom things without asking you how to do it haha.

Cheers again for the help, and I'll close since the question is not technically to do with luqum.

alexgarel commented 4 years ago

Ok, best wishes for your projects :-)

alexgarel commented 4 years ago

Out of curiosity I tried and indeed PLY is tricky. So doing as I told you by importing from parser in a new file does not work because previous rule is not canceled :-/

If I replace the original expression directly in the parser.py file:

def p_field_search(p):
    '''unary_expression : phrase_or_term COLUMN unary_expression'''
    if isinstance(p[3], Group):
        p[3] = group_to_fieldgroup(p[3])
    p[0] = SearchField(p[1].value.strip('"'), p[3])

It works:

>>> from luqum.parser import parser
Generating LALR tables
WARNING: 11 shift/reduce conflicts
>>> parser.parse('c:a OR b AND "x y":"z r"')
OrOperation(SearchField('c', Word('a')), AndOperation(Word('b'), SearchField('x y', Phrase('"z r"'))))