Open jmarton opened 6 years ago
Note: related issues were discussed earlier in #56. Essentially, handling whitespaces is very easy in CF grammar-based parsers such as plain ANTLR and Xtext (which uses ANTLR 3 internally).
However, it is not trivial in PEG-based tools such as Parboiled. The Handling Whitespace wikipage of the parboiled project says this:
One disadvantage of PEGs over lexer based CFGs can be the handling of white space. In a traditional CFG based parser with a separate lexer (scanner) phase this lexer might simply skip all white space and only generate tokens for the actual parser to operate on. This can free the actual parser grammar from all white space treatment. Since PEGs do not have a lexer but directly operate on the raw input they have to deal with white space in the grammar itself. Language designers with little experience in PEGs can sometime be unsure of how to best handle white space in their grammar.
Thank you @szarnyasg for the reference and raising the point of PEGs.
However, PEGs usually do a greedy matching, so keeping whitespaces in the grammar and dropping (hiding) them in the ANTLR4 artifact:
SP* SP*
, its first part will consume all whitespaces<repeat min="1">
<non-terminal ref="ReadingClause"/> &WS;
</repeat>
I think this makes a lot of sense, and will certainly simplify the grammar as a whole. I don't think I can foresee what exact consequences (if any) this will have for the language, but I can't come up with any real problem, if the rule was formulated that whitespace is allowed between tokens in any position of the query.
Some examples that I've considered that don't seem to be a problem:
RETURN toInt ( '1' )
RETURN 1 IS NOT NULL
MATCH ( a : A ) - [ r : TYPE | TYPE2 ] - > ( b : B { b : 1 } )
(EDIT: GitHub unhelpfully renders my multi-whitespace separations as single-whitespace in the above. You'll have to imagine wider separation, although it doesn't alter the point really.)
As correctly identified by @jmarton in the above, the specification for whitespace as currently done is to model this very rule, but it turns out it's pretty tricky to get it right.
@thobe Do you have any input for this discussion?
Looks like I'm seeing this quite late.
I think what we should do is add a classification of non-terminals whether they are expected to be handled on a lexer level or on a parser level. A processing tool can then use that information to inject whitespace in appropriate places for output formats that needs to be explicit about it.
I agree, and we already have started such a scheme:
Yes. I'm not sure I'm a fan of the particular naming, but it does the trick. "All" we need to do is use that for some sort of "whitespace injection", and we're set.
Currently, the openCypher grammar has whitespaces modeled like the snippet below describing that
ReadPart
is a sequence of zero or moreReadingClause
separated by whitespaces.https://github.com/opencypher/openCypher/blob/8fa5658e433060a208971845a2b301d140461535/grammar/cypher.xml#L151-L155
To be more precise, the excerpt above also states that a
ReadPart
ends with a whitespace unless empty.However, the next excerpt from
ReadUpdateEnd
is formalized more strictly to describe a sequence of one ore more ofReadingClause
s that have whitespace separators strictly in between them, but not at the end.https://github.com/opencypher/openCypher/blob/8fa5658e433060a208971845a2b301d140461535/grammar/cypher.xml#L111-L114
The reason behind the difference between the two excerpts is to make the grammar unambiguous also for whitespace matches.
If we could switch to whitespace-independent grammar, then the excerpt above could be rewritten in a more straightforward way like:
I think it would be beneficial to switch to whitespace-independent formalization of the grammar. Also the patches in #300 could be rewritten in a more intuitive way then.
ReadingClause
sequence ofReadUpdateEnd
in its natural form despite the grammar says it is one followed by zero or more.What do you think?