vlasovskikh / funcparserlib

Recursive descent parsing library for Python based on functional combinators
https://funcparserlib.pirx.ru
MIT License
338 stars 38 forks source link

Backtracking #75

Open Kodiologist opened 2 years ago

Kodiologist commented 2 years ago

It looks like funcparserlib has no way to allow backtracking, in the manner of Parsec's try. #43 provides one example of where you might want this. I found myself wanting it in changing Hy to parse dfor forms like this (https://github.com/hylang/hy/issues/2324):

(dfor  x (range 10)  y (range 3)  #(x y) (* x y))

It would be the equivalent of this Python:

{(x, y): x*y  for x in range(10)  for y in range(3)}

dfor and friends in Hy have a relatively complicated parser, so here's a simplified example:

from collections import namedtuple
from funcparserlib.parser import tok, many, finished
from funcparserlib.lexer import Token

FORM = tok('FORM')

loopers = many(FORM + FORM)

lfor = loopers + FORM + finished
dfor = loopers + FORM + FORM + finished

def f(parser, ts):
    return parser.parse([Token(t, 1) for t in ts.split()])

print(f(lfor, 'FORM FORM  FORM FORM  FORM'))
print(f(dfor, 'FORM FORM  FORM FORM  FORM FORM'))

Here the example with lfor returns ([(1, 1), (1, 1)], 1, None), as desired, but for dfor, I get funcparserlib.parser.NoParseError: got unexpected end of input, expected: FORM. The problem is that loopers consumes all the FORM FORM pairs and doesn't try moving back one iteration of many to give the final FORM + FORM + finished in dfor a chance to match.

vlasovskikh commented 2 years ago

See also the docs on try in Parsec.

vlasovskikh commented 2 years ago

Note: This Markdown reply is executable via python3 -m doctest reply-text.md.

Imports:

>>> from funcparserlib.parser import a, many, finished, forward_decl, Parser, pure

The alternative <|> combinator in Parsec always uses the look ahead of 1. Therefore you have to use try there if you need an arbitrary look ahead. In contrast to that, the alternative | combinator in funcparserlib always uses an arbitrary look ahead. Here is an example with the look ahead of 2, but it's really arbitrary:

>>> x = a("x")
>>> y = a("y")
>>> z = a("z")
>>> xxy_or_xxz = x + x + y | x + x + z
>>> xxy_or_xxz.parse("xxy")
('x', 'x', 'y')
>>> xxy_or_xxz.parse("xxz")
('x', 'x', 'z')

@Kodiologist Could you please share how exacly you would have fixed the problem in your example by using Parsec's try? Then you and I could think of either a workaround or a feature request for funcparserlib.

many() is greedy, like * in regexps. It consumes all matching tokens and always succeeds:

>>> x = a("x")
>>> many_x = many(x)
>>> many_x.parse("xxx")
['x', 'x', 'x']
>>> many_x.parse("y")
[]

The problem with the greedy * in x*xx implemented via many() is that it consumes all the tokens and the + combinator doesn't do any backtracking:

>>> x = a("x")
>>> many_x_then_x = many(x) + x + x
>>> many_x_then_x.parse("xxx")
Traceback (most recent call last):
    ...
funcparserlib.parser.NoParseError: got unexpected end of input, expected: 'x'

I guess that you, in fact, want a non-greedy version of many(). A sort of the *? operator in regexps in addition to *, right? A regexp equivalent of this would be x*?xx. Then you could have written something like this:

>>> x = a("x")
>>> def non_greedy_many(p):
...     # Fake implementation for a single case
...     return a("x") >> list
>>> many_x_then_x = non_greedy_many(x) + x + x
>>> many_x_then_x.parse("xxx")
(['x'], 'x', 'x')

The problem is that I currenlty have no idea how to implement it easily, since + in funcparserlib doesn't support any backtracking.

Workaround 1. Extract the last x, x at the tree transformation level:

>>> x = a("x")
>>> def to_results(args):
...     x1, x2, xs = args
...     combined = [x1] + [x2] + xs
...     *rest, last1, last2 = combined
...     return rest, last1, last2
>>> many_x_then_x = x + x + many(x) >> to_results
>>> many_x_then_x.parse("xxx")
(['x'], 'x', 'x')
>>> many_x_then_x.parse("xx")
([], 'x', 'x')
>>> many_x_then_x.parse("x")
Traceback (most recent call last):
    ...
funcparserlib.parser.NoParseError: got unexpected end of input, expected: 'x'
>>> many_x_then_x.parse("")
Traceback (most recent call last):
    ...
funcparserlib.parser.NoParseError: got unexpected end of input, expected: 'x'
Kodiologist commented 2 years ago

It's been a long time (over a decade) since I touched Parsec and so perhaps I was wrong in thinking it can be done with try. But yes, a non-greedy repetition operator fits the bill here.