Open Fxhnd opened 7 years ago
The conflict was with the nonlogicals token. I added a precedence rule, and made it left associative and that seemed to end the conflict. However, I haven't tested it extensively.
Nice!
I have a couple questions on the addition of the *_error states for the grammar though. Is it a good idea for the parser to attempt to parse things that don't match the intended grammar? I see that the error states raise TypeExceptions
but I think there is the possibility where those error states would be confusing later on. For example it looks like the p_connective_error()
references back to the axiom_list
which, while I get the intent, may let the parser continue onwards into the sentence and consume more tokens.
By the time either a p_connective_error
is totally matched or the p_error
is hit, the error that gets raised won't have anything to do with the actual problem anymore. For this reason I think it's better to lean back on just using the p_error
built-in and throw the current token + line#. If we keep the grammar so it only proceeds according to our desired syntax it'll throw earlier and less ambiguously.
I agree that the rules definitely add ambiguity to the grammar, but for my project I need to be able to identify some common errors. From p_error
there's no way to tell what went wrong without doing expensive string comparisons, so I believe this method has lower overhead. I could make a second parser, something like Parser_Unstable.py
, if you think that's a good idea.
I haven't done extensive testing on the grammars I've written, but I have tested them, and they seem to work fine. Additionally, I think if the grammar can't identify the error correctly it should just default to p_error
which will give the default syntax error stuff.
It might be worthwhile creating two versions of the grammar: one "parser" (like the one Rob had) that is just interested in parsing a correct file and throws a possibly cryptic error if something is wrong and a second "debugger" that may continue parsing after encountering an error as attempt to figure out what is wrong. Ideally, the second one builds on the first one (maybe have a "debug" mode?) and doesn't require two complete sets of grammars.
I've been messing around and researching this issue and here's what I've found:
parsetab.py
every time we change modes, which will slow down performance.I definitely want to avoid having two versions of the parser since that would quickly become a pain to maintain if one version of the parser gets ahead of the other. The maintainer of PLY also states the PLY library doesn't play nicely with trying to support two different parsers within the same namespace so there is that too.
I might be misunderstanding what the extra *_error()
states in the BNF give us. Looking at the code I can see that they raise TypeError
if certain, expected, malformed input conditions are met. However, since the the *_error()
are states just like any other it's possible for them to fail to parse correctly as well. In that case they all fall back to the aforementioned p_error()
function. What is the expected output of the parser on malformed input?
Input like
(forall (x)
((and
(not (ZEX x))
(Cont x x)
))
)
with the proposed scheme will yield errors like, TypeError: There may be a missing ")" parenthesis
or if you remove the nested p_error(p)
function call it'll yield TypeError: Missing connective
Instead of baking in non-exhaustive error states I think we'd be better off falling into the p_error(p)
function and using the current lexing TOKEN
plus parsing stack from yacc.yacc
to provide useful feedback. e.g. the same snippet from above would produce:
Error in input at line 9!
Unexpected Token: (
Last Token: (
Parser Stack:
LPAREN START URI LPAREN FORALL LPAREN nonlogicals RPAREN LPAREN
Traceback (most recent call last):
File "../../bin/parser.py", line 30, in <module>
ontology = Parser.parse_file(args.file, args.sub, args.base, args.resolve)
File "/users/rpskillet/Vault/thesis/macleod/macleod/parsing/Parser.py", line 329, in parse_file
parsed_objects = yacc.parse(buff)
File "/usr/local/lib/python3.6/site-packages/ply/yacc.py", line 333, in parse
return self.parseopt_notrack(input, lexer, debug, tracking, tokenfunc)
File "/usr/local/lib/python3.6/site-packages/ply/yacc.py", line 1201, in parseopt_notrack
tok = call_errorfunc(self.errorfunc, errtoken, self)
File "/usr/local/lib/python3.6/site-packages/ply/yacc.py", line 192, in call_errorfunc
r = errorfunc(token)
File "/users/rpskillet/Vault/thesis/macleod/macleod/parsing/Parser.py", line 308, in p_error
raise SyntaxError("Unexpected token found in input!")
SyntaxError: Unexpected token found in input!
If we really want to decorate further then we could add add grammar rules that expect the error
token. This is different than adding additional states to parse malformed input. For example, instead of including the new rule p_connective_error
and adding it as a reduction for p_axiom
we'd just add mirrors of our existing rules (e.g. p_conjunction_error
). It looks like this was done with p_comment_error
already. Building off the example from before :
def p_universal_error(p):
"""
universal : LPAREN FORALL LPAREN nonlogicals RPAREN error axiom RPAREN
| LPAREN FORALL LPAREN nonlogicals RPAREN error RPAREN
| LPAREN FORALL LPAREN nonlogicals RPAREN axiom error RPAREN
"""
raise ParseError("Error parsing term in Universal")
which would yield the error output of
Error in input at line 9!
Unexpected Token: (
Last Token: (
Parser Stack:
LPAREN START URI LPAREN FORALL LPAREN nonlogicals RPAREN LPAREN
Traceback (most recent call last):
File "../../bin/parser.py", line 30, in <module>
ontology = Parser.parse_file(args.file, args.sub, args.base, args.resolve)
File "/users/rpskillet/Vault/thesis/macleod/macleod/parsing/Parser.py", line 345, in parse_file
parsed_objects = yacc.parse(buff)
File "/usr/local/lib/python3.6/site-packages/ply/yacc.py", line 333, in parse
return self.parseopt_notrack(input, lexer, debug, tracking, tokenfunc)
File "/usr/local/lib/python3.6/site-packages/ply/yacc.py", line 1120, in parseopt_notrack
p.callable(pslice)
File "/users/rpskillet/Vault/thesis/macleod/macleod/parsing/Parser.py", line 238, in p_universal_error
raise ParseError("Error parsing term in Universal")
macleod.parsing.Parser.ParseError: Error parsing term in Universal
The nice thing about the error
token is that it will continue to consume tokens from the lexer until it's able to resynchronize the parser and settle on which term we fudged. Normally this is used for error recovery and is the recommended procedure to follow that doesn't include black magic and other witchcraft. If mirrors are written for each p_*
rule that we have then we both tighten the definition of what we can parse and can give very verbose feedback on what is wrong with a parsed string. Note the example rule I gave above is very minimal, in reality you would look at each of the tokens in p
to figure out which error state you're in and print accordingly.
Of course the parser could also stand to have it's states renamed and some of the recursive rules revised so they are more clear. Hopefully that helps, let me know what you think!
Alrighty, I see what you're saying about malformed input. I suppose I didn't realize what kind of things we can do with PLY. I will remove the malformed input rule and try to just use the error token
@Fxhnd is it best to have the grammars consider the parentheses at the highest level or the lowest level? It seems like we try to take them into account as close to the terminals as possible, but I'm not sure what the reasoning is for one way or another. Should we try to stick to one convention?
I'm not sure if there is a solid design/logic benefit one way or another. I tried to keep the parentheses inside the sub-rules just to keep the rules less cluttered at the higher levels:
axiom : conjunction | disjunction | etc..
looks better to me than
axiom : lparen conjunction rparen | ...
plus having the parens within the sub-rules lets us do things like print the whole "construct" including the parens for debugging.
Also, I whipped up a quick and useful error output for the p_error()
function that we could probably build off of. I think it'll probably do something funny if the error token ends up being parsed into a nonlogical so that's something that needs looking at. It prints out the construct that where the expected token was found and also underlines the error within the output.
Error at line 18! Unexpected Token: '(̲' :: "( forall ( x ) ( (̲ and ( not ( ZEX x ) ) ( Cont x x ) ) ) ) "
Error at line 15! Unexpected Token: 'l̲o̲l̲' :: "( forall l̲o̲l̲ ( x ) ( and ( not ( ZEX x ) ) ( Cont x x ) ) ) "
Let me know if you have any other questions!
def p_error(p):
global parser
# A little stack manipulation here to get everything we need
stack = [symbol for symbol in parser.symstack][1:]
pivot = len(stack) - stack[::-1].index(next((x for x in stack[::-1] if x.type == 'axiom'), None))
current_axiom = stack[pivot:]
current_axiom.append(p)
# Use the brace level to figure out how many future tokens we need to complete the error token
lparens = len([x for x in current_axiom if x.type == "LPAREN"])
lookahead_tokens = []
while lparens != 0:
lookahead_token = parser.token()
if lookahead_token == None:
break
else:
lookahead_tokens.append(lookahead_token)
if lookahead_token.type == "RPAREN":
lparens -= 1
elif lookahead_token.type == "LPAREN":
lparens += 1
# Put together a full list of tokens for the error token
current_axiom += lookahead_tokens
# String manipulation to "underbar" the error token
axiom_string = []
overbar_error = ''.join([x+'\u0332' for x in p.value])
p.value = overbar_error
for token in current_axiom:
raw_token = token.value
if isinstance(raw_token, str):
axiom_string.append(raw_token + ' ')
elif isinstance(raw_token, list):
for sub_token in raw_token:
axiom_string.append(sub_token + ' ')
print("""Error at line {}! Unexpected Token: '{}' :: "{}"\n\n{}""".format(p.lexer.lineno, p.value, ''.join(axiom_string), ' '.join([symbol.type for symbol])
return p
When Parser.py generates the parsetab file it throws a warning about a shift/reduce conflict. Essentially this means that the grammar as defined has one situation where the parser wouldn't know if it should shift another character or reduce to an already matching rule. I don't think it's a problem because the default behavior is that the parser will shift -- which for this grammar should work all the time as a solution.
It'd still be good to clean up the grammar and document it though.