python-parsy / parsy

Easy and elegant parser combinators for Python. With awesome docs.
MIT License
366 stars 37 forks source link

RFC: Automagical left recursion elimination #5

Open bugaevc opened 7 years ago

bugaevc commented 7 years ago

Sometimes, it would be very convenient to implement a parser using left recursion, like in this example:

number = digit.at_least(1).map(''.join).map(int)

@generate
def expr():
    lhs = yield expr
    op = yield string('+') | string('-')
    rhs = yield number
     if op == '+':
        return lhs + rhs
    else:
        return lhs - rhs

expr |= number

Note that this cannot be easily rewritten to use number +/- expr, because we need to parse 1 - 2 - 3 as (1 - 2) - 3 and not 1 - (2 - 3).

This currently doesn't work because calling expr inside of expr causes infinite recursion. The proper way to implement this is to unwrap the recursion manually, i.e. to use an explicit loop that consumes and adds/subtracts a sequence of numbers. However, this is much less convenient than implementing 'the mental model' directly.

This RFC proposes for parsy to be able to do such a transformation automatically.

A proof-of-concept (barely readable) implementation is here, and here's an explanation in pseudocode (that skips over many small details):

recursion_detected = False
parser_is_running = False
saved_result = None

def invoke_parser():
    global parser_is_running, recursion_detected, saved_result

    if not parser_is_running:
        # we are the first invocation of this parser at this index
        parser_is_running = True
        res = normal_logic()
    else:
        # we are the second left-recursive invocation
        recursion_detected = True
        return saved_result or fail('recursion detected')

    # we are the first invocation again
    if not recursion_detected or res.is_fail():
        parser_is_running = False  # and other cleanup
        return res

    while True:
        saved_result = res
        res = normal_logic()
        if res.is_fail():
            parser_is_running = False  # and other cleanup
            return saved_result

This works as if the recursion magically failed at just the right depth, giving the alternative sub-parsers (number in the example above) a chance to try their logic.

Unresolved questions:

spookylukey commented 7 years ago

I currently don't understand either the problem or your solution well enough to be able to give feedback on this. The questions around the implementation, plus the use of mutability etc. make me think this is not a great idea, at least for a default. Also, it would be good to know what other similar projects do for this case.

spookylukey commented 7 years ago

Thinking a little bit more:

I suspect the underlying issue here is that you are doing parsing and evaluating in one step. I think for these situations (i.e. where you have some precedence rules that you need to apply), then you should be using parsy to lex the string into tokens, and then run a separate parser and evaluator that consumes the tokens. Something like this might help. http://effbot.org/zone/simple-top-down-parsing.htm

It feels like parsy is the wrong place to be building the logic to cope with these kind of issues.

bugaevc commented 7 years ago

The questions around the implementation, plus the use of mutability etc. make me think this is not a great idea, at least for a default.

Agreed. A proper implementation that I hope to write soon should be way more understandable, and I have an idea about how to mitigate the problem of mutability.

I currently don't understand either the problem or your solution well enough to be able to give feedback on this. <...> Also, it would be good to know what other similar projects do for this case.

The problem is that it isn't convenient to parse left-recursive grammars using parsy. The most common case of left recursion (and the one shown in the example above) is parsing left-associative operator chains. Parsec implements a different, perhaps less magical, solution in the form of:

chainl1 :: Stream s m t => ParsecT s u m a -> ParsecT s u m (a -> a -> a) -> ParsecT s u m a

Now, I have a hard time parsing that, but IIUC it works like this, in parsy-like syntax:

sign = string('+').result(operator.add) | string('-').result(operator.sub)
expr = num.chained_by(sign)

This may be a bit more difficult to comprehend (higher-order functions and stuff) and it only works for that simple case; still I think it's a brilliant solution and we should implement that.

My "magical" solution is more generic and easier to use, but harder to reason about. Perhaps we can have both, the magical one being an explicit wrapper for "advanced" use-cases.

I suspect the underlying issue here is that you are doing parsing and evaluating in one step. I think for these situations (i.e. where you have some precedence rules that you need to apply), then you should be using parsy to lex the string into tokens, and then run a separate parser and evaluator that consumes the tokens. Something like this might help. http://effbot.org/zone/simple-top-down-parsing.htm

I don't think that's the problem. Right now, parsy is a great framework for creating lexers; it was possible to create parsers with it before, too. In fact, I was thinking we'd build a more robust and high-level infrastructure for these things, similar to

infix("+", 10); infix("-", 10)
infix("*", 20); infix("/", 20)
infix_r("**", 30)
prefix("+", 100); prefix("-", 100)

symbol("(literal)").nud = lambda self: self
symbol("(end)")

code from the article you've linked, and left recursion elimination (in one form or another) is the first step to that. Ideally, I'd imagine something like:


import parsy.lexems as pl
from parsy.op import infix, priority_group

expr = parsy.Placeholder()

ops = parsy.operators(
    priority_group(
        infix(pl.plus),
        infix(pl.minus)
    ),
    priority_group(
        infix(pl.asterisk),
        infix(pl.slash)
    ),
)

expr2 = (
    pl.num |
    pl.lparen >> expr << pl.rparen |
    ops
)
expr.set(expr2)
spookylukey commented 7 years ago

@bugaevc Thanks for your explanation. I've done a bit more Googling and understand the issue with left-recursive grammars better now. I found the same solution as you did.

I also found, as I think you were implying, that there are some situations it doesn't work for e.g. for postfix left recursion, also with some solutions - https://www.reddit.com/r/haskelltil/comments/3ocukk/a_function_postfixchain_to_parse_a_left_recursive/ and http://www.joelwilliamson.ca/programming/parsing/haskell/2015/02/04/Parsec-Left-Recursion.html

I think the best approach at the moment is to implement something like chainl1 (with a better name!), possibly something for the postfix case, and have some cookbook style docs on left recursive grammars. If you wanted to contribute a version of your automagical implementation, we could add it as an explicit wrapper as you suggest, and mark it as experimental (I've haven't yet had the chance to get into the implementation, and I don't want to commit to maintaining something I don't understand!)

A further step would be adding the higher level constructs you talk about, which I'm definitely open to, if they are generic enough.

douglas-raillard-arm commented 2 years ago

Dropping it here in case someone encounters the same issue:

# Python translation of parsec's chainl1:
# https://hackage.haskell.org/package/parsec-3.1.15.1/docs/Text-Parsec-Combinator.html#v:chainl1
def parse_left_associative_expr(op, operand):
    def rest(x):
        @generate
        def f():
            f = yield op
            y = yield operand
            return rest(f(x, y))

        return f | success(x)

    @generate
    def f():
        x = yield operand
        return rest(x)
    return f

EDIT: removed custom const() and replaced by parsy's success()