javawizard / parcon

A Python parser combinator and formatter combinator library that's fast, easy to use, and provides informative error messages
http://parcon.opengroove.org
36 stars 19 forks source link

Unexpected backtracking behavior or bad error message #5

Open dvogel opened 12 years ago

dvogel commented 12 years ago
from parcon import SignificantLiteral, Regex

Digits = Regex(ur"[0-9]+")
DecimalLiteral = ((Digits + SignificantLiteral(".") + Digits)
                  | (SignificantLiteral(".") + Digits))
SinglePeriod = SignificantLiteral(".")
BadExpr = (SignificantLiteral("junk that won't match") | SinglePeriod)
WrappingExpr = (BadExpr | DecimalLiteral | SignificantLiteral(".."))

"""
Here's my understanding of what should happen:

Given the parser definitions above, when parsing "..", WrappingExpr
should attempt to parse using BadExpr. The first parser tried by
BadExpr will obviously fail, as designed. The second parser,
SinglePeriod will succeed. BadExpr will then succeed if all=False,
but with all=True it will fail. WrappingExpr will then attempt to
parse using DecimalLiteral. The first form of DecimalLiteral requires
a digit prefix, which fails to match. The second form of DecimalLiteral
matches the period prefix but then fails because it is not followed by
digits. WrappingExpr should then attempt the SignificantLiteral("..")
parser and that parser should succeed but instead I get:

    ParseException: Parse failure: At position 0: expected "junk that won't match"
"""
print "all=False"
print WrappingExpr.parse_string("..", all=False)
print "all=True"
print WrappingExpr.parse_string("..", all=True)
BrenBarn commented 9 years ago

The problem appears to be due to the way results and expectations are combined with filter_expectations. Specifically, it excludes "unsatisfiable" matches without considering whether they were successes or failures, and excludes them even if they occurred at a later match point than earlier "satisfiable" expectations.

The behavior can be reproduced more simply:

>>> g =  (SignificantLiteral("junko") | SignificantLiteral("."))
>>> g.parse_string('.x', all=False)
'.'
>>> g.parse_string('.x', all=True)
Traceback (most recent call last):
  File "<pyshell#21>", line 1, in <module>
    g.parse_string('.x', all=True)
  File "C:\Users\BrenBarn\Documents\Python\extensions\parcon\parcon\__init__.py", line 641, in parse_string
    raise ParseException("Parse failure: " + format_failure(result.expected), result.expected)
ParseException: Parse failure: At position 0: expected "junko"

It says it expects "junko" at the beginning, but that's wrong: it can accept either "junko" or "." at that position. I think the right behavior in this situation is to say "expected end of input at position 1" (i.e., after the period). The period did match, so our "best match" has already consumed the period, and our best guess about what should come next should start from after the period.

I can think of two ways to handle this. One is to modify filter_expectations so that it doesn't exclude EUnsatisfiable expectations at the maximum position. I'd have to look at the expectation handling more, but at first glance I don't understand why it's excluded; end-of-input is a perfectly reasonable thing to expect. The other solution is to make end-of-input its own kind of expectation, separate from EUnsatisfiable. It does seem a bit strange that EUnsatisfiable is used both for "we expect no more input" and for "logically impossible" cases like Invalid(). Invalid() cannot match even at the end of input, so it's different from expecting end of input.