Broken ANTLR4 grammar - Githubissues

opencypher / openCypher

Specification of the Cypher property graph query language

http://www.opencypher.org

Apache License 2.0

853 stars 150 forks source link

Broken ANTLR4 grammar #56

Closed dkunitsk closed 8 years ago

dkunitsk commented 8 years ago

Hi openCypher! I'm having some trouble using the provided ANTLR4 grammar to parse DDL statements.

First, in order to get ANTLR4 to generate target java code, I had to make a small change and rename the "return" rule. Otherwise, I would get the following error: error(134): Cypher.g4:71:9: symbol return conflicts with generated code in target language or runtime I am sure I have done this small change carefully though, without touching any other rules like returnBody.

Now, when I try to parse a simple DDL statement like:

CREATE (n:Person { name : 'Andres', title : 'Developer' }) \n

and print the parse tree using ANTLR's grun tool. I get the following errors:

line 1:17 extraneous input '{' expecting {')', WHITESPACE}
line 1:19 extraneous input 'name' expecting {')', WHITESPACE}
line 1:24 extraneous input ':' expecting {')', WHITESPACE}
line 1:26 extraneous input ''Andres'' expecting {')', WHITESPACE}
line 1:35 extraneous input ' ' expecting {'(', CYPHER, EXPLAIN, PROFILE, USING, PERIODIC, COMMIT, UNION, ALL, CREATE, DROP, INDEX, ON, CONSTRAINT, ASSERT, IS, UNIQUE, EXISTS, LOAD, CSV, WITH, HEADERS, FROM, AS, FIELDTERMINATOR, OPTIONAL, MATCH, UNWIND, MERGE, SET, DELETE, DETACH, REMOVE, FOREACH, IN, DISTINCT, RETURN, ORDER, BY, L_SKIP, LIMIT, DESCENDING, DESC, ASCENDING, ASC, JOIN, SCAN, START, NODE, RELATIONSHIP, REL, WHERE, SHORTESTPATH, ALLSHORTESTPATHS, OR, XOR, AND, NOT, STARTS, ENDS, CONTAINS, NULL, TRUE, FALSE, COUNT, FILTER, EXTRACT, ANY, NONE, SINGLE, REDUCE, CASE, ELSE, END, WHEN, THEN, L_0X, UnescapedSymbolicName, EscapedSymbolicName}
line 1:42 extraneous input ':' expecting {'=', WHITESPACE}
line 1:44 extraneous input ''Developer'' expecting {'=', WHITESPACE}
line 1:56 extraneous input '}' expecting {'=', WHITESPACE}
line 2:0 mismatched input '<EOF>' expecting '='

Additionally, while queries like this work:

MATCH (node1:Label1)

WHERE node1.propertyA = {value}

RETURN node2.propertyA, node2.propertyB

queries like this:

MATCH (tom:Person)-[:ACTED_IN]->(tomHanksMovies) RETURN tom

give the following error: line 1:18 mismatched input '-' expecting {<EOF>, ',', USING, UNION, CREATE, LOAD, WITH, OPTIONAL, MATCH, UNWIND, MERGE, SET, DELETE, DETACH, REMOVE, FOREACH, RETURN, START, WHERE, WHITESPACE}

Do you think there could be a bug in the provided grammar, or am I doing something wrong? Thank you!

Mats-SX commented 8 years ago

Hello @dkunitsk and thanks a lot for your feedback!

You're doing it just right; the problems you see are due to whitespace bugs in the grammar specification. We are aware that we have a number of these left to smoke out, and your testing is appreciated!

I've fixed the two queries that you've reported (with the exception of the stray \n that I assume is a typo from you in the first query), and will push a commit shortly.

Please keep these kind of findings coming; I'm sure there are more bugs in the grammar still.

FylmTM commented 8 years ago

Hi @Mats-SX,

Minor comment on whitespace in Cypher ANTLR4 grammar. I don't have (unfortunately) fixed version that I managed to get, but:

Whitespace that might be, or might not be between tokens should be send to HIDDEN channel. Example:

WS : [ \t\n\r]+ -> channel(HIDDEN) ;

Whitespace (SP) that must be between token should be placed straight into rules.

I don't know what magic ANTLR is doing with HIDDEN channel, but it solves issues.

When you are placing WS straight into grammar sometimes it tries to find space as token delimiter, where space is not necessary.

Hope it helps!

Mats-SX commented 8 years ago

@FylmTM We are trying to have our grammar specification be as general as possible, and not gearing it too much towards a certain parser generation tool, such as Antlr. The Antlr grammar is more like a proof of concept for us.

I'm not familiar with this hidden channel, but what kind of problems are you seeing exactly? Are there queries that aren't parseable due to us using optional whitespace directly in the rules?

FylmTM commented 8 years ago

Hi @Mats-SX ,

I managed to recover ANTLR4 grammar that I fixed for my own research.

As you can see I removed all WS from rules. Having WS straight into rules cause parser sometimes to fail. Some syntetic example:

Fail: (n) Works: ( n)

In fact I don't have any ANTLR experience. For me hiding WS into HIDDEN channel worked.

Gist: https://gist.github.com/FylmTM/4b34d4d6ad8394c524c9b98bf2f35bb7

Mats-SX commented 8 years ago

@FylmTM That's quite the simplification of the grammar! For many purposes, that seems very likely to be the better approach to take. Nice find, and thanks for sharing!

On the other hand, it seems to me that this may mask that the grammar source has an incorrect whitespace rule setup. Our main use of the Antlr grammar is for proof-checking the source, and we want it to fail when there's a whitespace bug (as opposed to magically fixing it with an Antlr-specific feature). So for our purposes, I think we'll not make the generator remove WS rules for the hidden channel, but I can definitely see how that is useful for a consumer of the Antlr grammar, such as @dkunitsk.

FylmTM commented 8 years ago

Hi @Mats-SX,

Can you share your Cypher queries that you used to test grammar? This are mine queries.

If you take existing ANTLR grammar (from README) and try to parse timetree - you will see some whitespace-related false error marks.

Example: 1: WITH range(2011, 2014) - error here. If we removed space between 2011,2014 then everything will be parsed correctly and next error will be shown. The issue with this place is that command in function arguments collides with a comma between return items.

2: FOREACH(year IN years | - we should remove space before |. 4: FOREACH(month IN months | - and here.

3: MERGE (y:Year {year: year}) - error is now here. If we remove space before { it starts to work correctly. 5: CREATE (m:Month {month: month}) - same.

6: MERGE (y)-[:HAS_MONTH]->(m) - if we place space before - then it removes error mark.

Then it goes crazy on first CASE rule.

From what I have seen so far (during grammar plugin creation and further experimentation with ANTLR) - Parboiled parser works differently compared to any other popular tool. Reason - Parboiled is lexer-less and constructs parse tree during actual parsing. However, tools like ANTLR are splitting input into tokens at the first step and then constructing parse tree from tokens.

And if you have ws rule in your grammar and it is placed in inside other rules, then you will have ws token. And in tree construction phase it will try to match this rule (even if rule content can match empty string).

Currently, an only possible way how to achieve correct work (from what I see) is - ignore ws during tree creation.

Overall - I don't have any final working solution. But I hope that it can be found.

Mats-SX commented 8 years ago

Hey @FylmTM

Right now there's no built-in checking of known correct queries to the grammar. This PR: https://github.com/opencypher/openCypher/pull/57 adds a system to do that kind of checking, and it'll be merged soon.

Previously, the grammar had so many whitespace bugs that it wasn't feasible to have an automated system; it would just break all the time. Instead we went with exploratory testing, and I didn't keep track of the queries I used for that. I hope to redeem these efforts by the above addition, to ensure that any subsequent changes aren't causing regressions.

Thanks for sharing your queries. I'll copy them and add to the (so far rather small) list of queries that the parser must parse correctly, if you don't mind?

I'm interpreting all your points above as bug reports. I confirmed that 1) is caused by a bug in the functionInvocation rule, which does not declare that whitespace is an option between arguments of the function (after comma). I'm fairly certain the others are similar bugs.

Parboiled and Antlr are different, you're right. The situation we have, however, is that the only correct parser for Cypher that exists today is a Parboiled parser implementation. We want to achieve an implementation-agnostic grammar specification, but we aren't quite there yet. But given that goal, I think it is a good idea to have these different parser implementation systems around, because it forces our grammar to a position that is not geared towards a certain parsing idiom, but one that is more general.

Don't get me wrong; I definitely acknowledge that removing whitespace parsing from the Antlr grammar would probably make for a better Antlr parser. It's just that our goal is not to provide a good Antlr parser, it's to provide a solid and correct Cypher grammar. If you and/or others benefit from a good Antlr parser for Cypher, that's great, and I think your insights are valuable to any and all who are looking at using an Antlr parser for Cypher. If you'd like, you're very welcome to issue a PR to this repository that adds an additional grammar generator for Antlr, which does the channel-hidden thing for whitespace.

EDIT: I should note that I plan on fixing those whitespace bugs you reported right away.