Ambiguity in Rascal grammar in an expression with both `^` and `$` does not arise from a semantic ambiguity

usethesource / rascal

The implementation of the Rascal meta-programming language (including interpreter, type checker, parser generator, compiler and JVM based run-time system)

http://www.rascal-mpl.org

Other

408 stars 77 forks source link

Ambiguity in Rascal grammar in an expression with both `^` and `$` does not arise from a semantic ambiguity #1200

Open eh-adblockplus-org opened 6 years ago

eh-adblockplus-org commented 6 years ago

The following grammar fails to compile. WholeLine fails to compile because it triggers an ambiguity in the Rascal grammar, which then has no case statement to handle the ambiguity node in sym2symbol, whose default throws an exception that terminates compilation.

module ssce
lexical WholeLine = ^ Contents $ ;
lexical WholeLineWorkaround = (^ Contents ()) $ ;
lexical Contents = ![\r\n]+ ;

The grammar rules that provided the opportunity for ambiguity are these, from Rascal.rsc:

Syntax Sym
[...]
    | endOfLine: Sym symbol "$" 
    | startOfLine: "^" Sym symbol

There is no associativity declared for these productions. Because they're non-associative, you can't put both of them on the same symbol without causing an ambiguity. The semantics of WholeLine, however, are unambiguous. The Rascal grammar here is simply counter-intuitive.

I was working inside Eclipse (0.9 stable, although I first had the problem with the unstable branch), there was a compilation error that was silent. The problem was showing up only in the Eclipse window Error Log, which is not part of the default Rascal perspective. I only saw the compilation error after I made a fresh Eclipse installation and loaded the project before the Rascal perspective took over.

Furthermore, this error was not showing up as a syntax error in the editing window. Apparently there's some mismatch in behavior between the IDE grammar and the compiler grammar for this case.

The ordinary workaround non-associative operators is to use parentheses to make the association explicit. This doesn't work in the current grammar, though, because of the way that parentheses-terms work.

Syntax Sym
[...]
    | sequence: "(" Sym first Sym+ sequence ")"

Parentheses only apply to lists of two or more symbols and do not apply to single symbol. Thus the workaround has extra parentheses as an empty symbol just to get a second symbol for the disambiguating parentheses. It's not clear to me why there's no single symbol production for parentheses, but there's not, and it turns the obvious workaround into something more complicated.

This one issue report contains more than one problem, admittedly. Here's a list of the things to do:

Semantically speaking, ^ and $ are just special versions of precede and follow rules, which do have associativity defined. Perhaps the easiest fix is to move the productions for ^ and $ in with those of precede and follow. If that's not right for some reason, they could just get their own associativity group.
Ensure that there's a production for "single-symbol sequences". These are simple disambiguating parentheses. Either add another production for it or allow existing sequences to contain only one element.
Ensure that ambiguity errors that arise from the Rascal grammar itself appears as syntax errors in the IDE. Given the code I've seen, I'd guess that the lack of reporting might be generic to all ambiguity errors. For example, there are the same symptoms with this production: lexical AnotherLine = ^ Contents ? ;
Add the Eclipse Error Log to the Rascal perspective. It will expose errors to the user that would otherwise be masked.

speakeasy commented 6 years ago

Ensure that ambiguity errors that arise from the Rascal grammar itself appears as syntax errors in the IDE. Given the code I've seen, I'd guess that the lack of reporting might be generic to all ambiguity errors. For example, there are the same symptoms with this production: lexical AnotherLine = ^ Contents ? ;

I think there may be a logging level for it but I haven't seen this issue come up in a while and I don't remember if or where you might find it in the documentation. You may try to look at the logging method as well..

On Tue, Jul 17, 2018, 10:09 PM Eric notifications@github.com wrote:

The following grammar fails to compile. WholeLine fails to compile because it triggers an ambiguity in the Rascal grammar, which then has no case statement to handle the ambiguity node in sym2symbol, whose default throws an exception that terminates compilation.

module ssce lexical WholeLine = ^ Contents $ ; lexical WholeLineWorkaround = (^ Contents ()) $ ; lexical Contents = ![\r\n]+ ;

The grammar rules that provided the opportunity for ambiguity are these, from Rascal.rsc:

Syntax Sym [...] | endOfLine: Sym symbol "$" | startOfLine: "^" Sym symbol

There is no associativity declared for these productions. Because they're non-associative, you can't put both of them on the same symbol without causing an ambiguity. The semantics of WholeLine, however, are unambiguous. The Rascal grammar here is simply counter-intuitive.

I was working inside Eclipse (0.9 stable, although I first had the problem with the unstable branch), there was a compilation error that was silent. The problem was showing up only in the Eclipse window Error Log, which is not part of the default Rascal perspective. I only saw the compilation error after I made a fresh Eclipse installation and loaded the project before the Rascal perspective took over.

Furthermore, this error was not showing up as a syntax error in the editing window. Apparently there's some mismatch in behavior between the IDE grammar and the compiler grammar for this case.

The ordinary workaround non-associative operators is to use parentheses to make the association explicit. This doesn't work in the current grammar, though, because of the way that parentheses-terms work.

Syntax Sym [...] | sequence: "(" Sym first Sym+ sequence ")"

Parentheses only apply to lists of two or more symbols and do not apply to single symbol. Thus the workaround has extra parentheses as an empty symbol just to get a second symbol for the disambiguating parentheses. It's not clear to me why there's no single symbol production for parentheses, but there's not, and it turns the obvious workaround into something more complicated.

This one issue report contains more than one problem, admittedly. Here's a list of the things to do:

Semantically speaking, ^ and $ are just special versions of precede and follow rules, which do have associativity defined. Perhaps the easiest fix is to move the productions for ^ and $ in with those of precede and follow. If that's not right for some reason, they could just get their own associativity group.

Ensure that there's a production for "single-symbol sequences". These are simple disambiguating parentheses. Either add another production for it or allow existing sequences to contain only one element.

Ensure that ambiguity errors that arise from the Rascal grammar itself appears as syntax errors in the IDE. Given the code I've seen, I'd guess that the lack of reporting might be generic to all ambiguity errors. For example, there are the same symptoms with this production: lexical AnotherLine = ^ Contents ? ;

Add the Eclipse Error Log to the Rascal perspective. It will expose errors to the user that would otherwise be masked.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/usethesource/rascal/issues/1200, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfD_49_0D_ih-Wo00W9QNakuNQ2rBflks5uHpi-gaJpZM4VT07P .

DavyLandman commented 6 years ago

Thanks for your report, we should indeed look at removing this ambiguity from the rascal grammar. (there are a few more sadly).

Error reporting

I completly agree, in our current setup we do not report all errors of imports. I would have to look into why again, but there is a technical constraint at the moment. And we try to have as a rule that the error log should only be for internal errors that should never be shown to the user. This is obviously not one of those, so it should be reported in a user view.

Importing the module with the grammar in the REPL does tend to show the error message. But then you would have had to know that that was the faulty module to begin with.

Our eclipse editor used color ambiguous parts red, it seems we have now chosen for just pick one of the alternatives and use that for highlighting. I'll look into this code to see if we can do both, keep the highlighting, but also report an error. (although I fear this is an Eclipse thing, where inside the token colloring code it's hard to report an error, but I will look into it, so that we get earlier feedback). I myself was also bitten by this just yesterday (a different ambiguity).

Workaround

You are right, for most cases you could use precede and follow restrictions, however they don't correctly work together with begining of file/stream and end of file/stream. So your workaround with the empy symbol is indeed the option I would go for as well. I remember there was a specific reason for enforcing that the brackets can only be used to group more than one symbol, I think @jurgenvinju can give that context.

module GrammarTest

import ParseTree;

lexical A1 = ^ (() "a" $);

lexical A2 = [\n\r] << "a" >> [\r\n];

test bool a1() = appl(_,_) := parse(#A1, "a");
test bool a2() = appl(_,_) := parse(#A2, "a");

running:

rascal>import GrammarTest;
ok
rascal>:test
Running tests for GrammarTest
| testing 0/2 | testing 0/2 error: a2 @ |project://testing/src/GrammarTest.rsc|(162,46,<10,0>,<10,46>)
        |std:///ParseTree.rsc|:400,0: ParseError(|unknown:///|(0,1,<1,0>,<1,1>))
        at *** somewhere ***(|std:///ParseTree.rsc|(13083,1963,<400,0>,<446,114>))
        at parse(|std:///ParseTree.rsc|(15039,5,<446,107>,<446,112>))
        at $root$(|main://$root$|)
Test report for GrammarTest                                                 
        1/2 tests succeeded
        0/2 tests failed
        1/2 tests threw exceptions
bool: false
rascal>

Implementation detail

A small warning, the $ matches the \n only, no matter which operating system you are running in. This in general works, as both windows and linux (+osx) newlines all end with the \n.

But the parser you defined will fail on this input: "xx\r\n". as the Contents non-terminal will stop parsing at \r, but the $ won't match the \r.

For the lang::std::Comment I introduced this mistake in 2012: f8a011de01f329b15768b5d9a239ca0af25dd1b6 and only recently I found this was breaking our DSLs (those that chose to import the available definitions) in Windows, and I undid my "fix": e30cb711b7f18c03882e718dfa10b9fed7477eec.. (and a bit later, the same tool that cause my earlier mistake motivated @jurgenvinju to again add the \r to the character class ( 8dccee2960b84cb74af08d5755567770c7be4d51 ) and after our CI showed how it failed, revert that: 9fd91bca8df2bb97b991edc56eb4a8367f196552 )

eh-adblockplus-org commented 6 years ago

Thanks for mentioning that import was involved in the error being silent. That was my case as well, but I neglected to mention in my report.

I recommend showing the Error Log not as a point of principle, but purely one of practicality. At the current state of maturity, it's better to err on the side of letting the user see too much than too little. At some point in the system's maturity, the Error Log would just become UI cruft that would never be used. But that's not the state of things today.

Line endings are indeed difficult to get right. Having $ match \n only, while expedient, isn't really correct, since, as you point out, on Windows it leaves \r to be matched by another pattern. This isn't an issue with languages that use whitespace as layout, the typical case for programming languages, but I'm currently working with one where that's not the case. We don't really have to deal much with (original) Mac OS much more, but I know from personal experience (because I had to do a rewrite) that the SMTP specification allows all of LF, CR LF, and CR as valid end-of-line markers. And this problem isn't going away. There's now U+2028 LINE SEPARATOR, LS, for example, and that's not the only code point that acts as an end-of-line marker.

jurgenvinju commented 6 years ago

Because the brackets introduce another level in the tree, i.e. (A B) is desugared as syntax (A B) = A B, I avoided another interpretation of ( Sym ) as ignorable brackets. It's still possible to add this without introducing more ambigity, so I'd like your opinion (both) on that design choice. I.e. adding the rule syntax Sym = bracket "(" Sym ")"; next to the sequence combinator.

The same brackets, meta-syntactically, would mean something different. I thought that would be confusing.

Since non-terminal juxtapositioning is assocative, the brackets would never have any semantics I believe; their only use would be to disambiguate our own implementation as a workaround for the current bug, and perhaps to clarify a complex rule to the reader of a grammar. However, if such a rule exists that it needs brackets, perhaps to best way would be to introduce another named rule instead?

jurgenvinju commented 6 years ago

indeed "must-follow" rules will fail at EOF and at BOF; that's still a point to ponder. We thought about introducing a fictional character notation for EOF and BOF, such that people could write >> [\r\n\EOF]; that does not win a design price since the character really does not exist, and the character class algebra would have to be reconsidered with those funny characters in play.

I like the ^ and $ notation better for such things, i.e. additional constraints rather than magic characters.

jurgenvinju commented 6 years ago

About the OS dependent meaning of ^ and $, I'd like grammars to not have to encode these differences, like CSS should work the same for every browser.

However, so far no extremely bright solution has descended from above :-) Preprocessing every file towards a universal "rascal" normal-form (let's say the Uninx way) is not good, since it introduces "noise" and distances parse trees from the characters which an editor sees. We'd be shifting the problem from the parser to the bridges to the editors we use. Not a real solution.

Another solution direction might be to give ^ and $ different run-time semantics when they are executed on different operating systems. I'm leaning towards this solution, but it has some nasty consequences that I cannot oversee.

Namely, this would also have to include that [\n] would mean ([\r][\n]) on windows, and [\n] on linux, and [\r] on legacy macs, etc, for consistency's sake. Right? Or [\n] should mean [\r\n] on Windows? That would make things easier (see below), but does it make any sense? Can we give ^ and $ a OS-dependent semantics without giving \n an OS dependent semantics? That sounds inconsistent...
One issue that comes with this solution direction is when parsing files that originate from a different OS, first they must be dos2unixed... Ok, that can be collatoral damage IMHO, something we could accept.
The good thing is that Rascal-generated parsers would still be lossless and not noisy
Back-end code would still have to deal with the different contents (and possibly different shapes) of parse trees if they would analyze them to the lexical level, depending on how we would define \r\n.
I've no idea yet how to fix character class algebra in this sense. I.e. should [\n] + [a-z] (character class union) mean [\na-z] on unix and (([\n] [\r]) | [a-z]) on Windows? or should it be [\n\ra-z] on Windows? sequence the \r or add it disjunctively? Same for cc intersection and cc difference a semantics must be sought which is consistent and meaningful.

Again, your thoughts, both, on this would help perhaps to clarify this the ins and outs. This is a minefield.

DavyLandman commented 6 years ago

what if $ matches either [\r] !>> [\n], [\r][\n], or [\r]!<<[\n] or the new unicode ones?

then the only thing that will be messed up is inconsistent new-line character usage. so if you have a file that sometimes has a \r\r\n. How many lines is that?

I think that $ (and our other character classes) should not be OS-dependent. I do not want to have more errors with Windows/Linux differences.

jurgenvinju commented 6 years ago

in unix an accidental \r\n would then be parsed as-if on Windows. otherwise I think it would work. The issue I see is that $ would be OS independent while the character classes containing \r and \n still wouldn't be. This may be the right trade-off.

jurgenvinju commented 6 years ago

(from wikipedia)

jurgenvinju commented 6 years ago

it would be great if we could map CR+LF during character reading into a single Unicode-24 character, without loss of information, and also have a special escape character for that particular character. That would solve a lot of the above issues. I just don't see how yet.

Could we use an index higher than the 24-bit range? We have 32-bits after all. When writing that character again, it should be decoded back of course into the proper UTF-8 or whatever encoding was chosen. There's probably downsides to this as well.

PS: I'm going offline for three weeks now. See you after that!

eh-adblockplus-org commented 6 years ago

On parentheses.

Unfortunately for formal language designers, mathematicians long ago overloaded the usage of parentheses:

As a grouping element within an expression, with no particular meaning other than to define and clarify the order of operations.
As a delimiter for function evaluation, marking the expressions to be used as values of the arguments.
As a delimiter for function definition, marking the formal variables bound within an expression denoting the value of the function.

Computer language authors have also broadly used parentheses as part of the syntax of control flow statements.

I'm somewhat of a pragmatic traditionalist about parentheses in expressions. Traditional, because I think parentheses should never be required unless you need them because of operator precedence and never be prohibited because they're extraneous. Pragmatic, because this is the way all mathematics uses parentheses and that's a huge ship I'd never try to steer against.

Rascal is non-traditional in a few ways. Sym productions alternative and sequence both require parentheses at present, though neither is strictly necessary. Rascal also does not have "use-at-will" grouping parentheses. If it were me doing this from scratch, I'd have exactly one production with parenthesis: `"(" Sym? ")". Having said this, I don't know what might break if this change were made.

I'd likely also remove mandatory parentheses from control flow statements where the parentheses surround single expressions, but maybe that's just me.

jurgenvinju commented 6 years ago

ok thanks! the brackets around |... | nested alternatives are necessary for disambiguating with the normal outermost | disjunctive alternatives.

The reason for the non-ignorable brackets around sequence (A B C) are more subtle and complex (and also present for |, just that ambiguity is not really an issue there if we'd added normal ignorable brackets).

The main underlying principle is that every non-terminal in Rascal's grammars are types in its type system, so non-terminal constructors like |, ? and sequence, lists, etc. are also type constructors. Rascal does not have type coercion or sub-typing (for other reasons), so every one of these type constructors must leads to a new level in a parse tree. (A B) is the type of all parse trees which have been parsed using the non-terminal (A B) which is (automatically) defined by the rule syntax (A B) = A B.

The second principle is that the concrete syntax of symbols is very close to their abstract syntax (which is also their internal type representation), so a rule for Sym in the Rascal grammar corresponds directly to a rule for the abstract data type Symbol in the reified type representation.

If we would make the brackets less consequential, as they'd normally be in other expression languages, we could still generate the same parsers (I think), but we wouldn't have a proper types for parts of parse trees generated by regular expressions over non-terminals. For example, the optional sequence ("else" Stat*)? would not have a consistently typed representation. Also we'd loose the simple mapping from meta-syntax to reified type representation.

The current design for regular expressions over non-terminals is completely low brow and simple: every one of the combinators generates both a type and a set of productions to derive sentences of that type. The parentheses seems to be the only real problematic issue, and its mostly a meta-syntax design choice. Perhaps a different syntax for nested sequence, not overloading the parentheses, would be a good idea? I'm open to suggestions.

Or, perhaps indeed interpreting introducing syntax Sym = "(" Sym* ")", and then interpret(A B)asseq([A, B])while(A)becomes simplyAand()becomesempty, and(A | B |C)becomesalt({A,B,C})`, The mapping becomes complex, and possibly unstable. It's hard to oversee the consequences, but perhaps worth a try on a branch and some experimentation to see if we could get such a direction "right". I'm more for a simpler solution with fewer unknown consequences :-) I like the internal simplicity of the current design, but indeed not what it does to the semantics of brackets.

jurgenvinju commented 6 years ago

ok, Jurgen out, now I'm really gone fishing for three weeks :-) Hope to continue this after, unless you guys have moved on by then! Cheers.

eh-adblockplus-org commented 6 years ago

every non-terminal in Rascal's grammars are types in its type system

The goal, then is preserve this and also get type-transparency with respect to parentheses. That means, almost of necessity, that you don't want to operate on the syntax tree of the original definition because, by assumption, it has parentheses. The solution to me seems very Rascally: transform the tree by collapsing all the parentheses and define the types on a canonical tree. You've already got an implode() that does something in a related spirit. Define canonize() that gives the type-defining tree.

Canonical type trees could, with more effort and new language constructs, do more than just parentheses. But that's getting far afield from just dealing with semantically-neutral variation in the syntax of definition.

jurgenvinju commented 6 years ago

Yes, the current implementation is already quite close to that. I'll have a look later this summer if/how we can follow this through. Thanks for the discussion and suggestions. It helps.

jurgenvinju commented 6 years ago

See https://github.com/usethesource/rascal/blob/master/src/org/rascalmpl/library/lang/rascal/grammar/definition/Symbols.rsc which implodes parse trees of Sym to the Symbol datatype. That's the code we are talking about. It avoids concrete syntax in pattern matching for bootstrapping reasons.

DavyLandman commented 6 years ago

Regarding the newlines, this is why I think my earlier proposal is a good tradeoff.

I think our grammars should work the same on any OS. If only for the fact that we might use them to parse files not written in the same OS as it is currently running in.
Our character class expression generate a parser for a single Unicode code-point, that should remain the case. I do not think we need to do some weird automatic expansion of \n to all variants of newlines.
Neither should we automatically convert the input to normalize newlines, that just leads to confusions, and there are formats where a certain style of newlines are required, and all others are invalid. (For example both HTTP and CSV enforce \r\n. Also, this will seriously complicate our source locations.
The ^ and $ are special, they describe a precede and follow like precondition. Since they are not consuming the characters they match, I think it would be the cleanest to only extend their functionality. If a user wants to indeed match multiple kinds of newlines, they will have to write a lexical specifically with these options.
We might offer a standard lexical definition that people can reuse for newlines:

lexical LineTerminator
     = lineFeed: [\r] !<< [\n]
     | verticalTab: [\a0B]
     | formFeed: [\a0C]
     | carriageReturn: [\r] !>> [\n]
     | windowsNewline: [\r][\n]
     | nextLine: [\u0085]
     | lineSeparator: [\u2028]
     | paragraphSeparator: [\u2029]
    ;

jurgenvinju commented 6 years ago

Ok. It's the best trade-off for now. The lexical Library is good for documentation. Later the alias feature should allow us to substitute names for character classes and also compute unions and intersections and such..

eh-adblockplus-org commented 6 years ago

Unicode has a definition for line boundaries: Unicode Technical Standard #18, Unicode Regular Expressions: 1.6 Line Boundaries. It's in the context of regular expressions, a bit different than Rascal grammars, but they do address all the issues raised here, including proper definitions for ^ and $. While they don't mention it explicitly, the authors don't mention anything like platform-dependent behavior.

There is a definition of a "newline sequence" that is basically identical to LineTerminator. They also mention this: (FYI -- Java 8 regex now has \R.)

It is strongly recommended that there be a regular expression meta-character, such as "\R", for matching all line ending characters and sequences listed above

Rascal isn't a regex language, so I can't advocate for \R as used here, but it is the moral equivalent of what we've been discussing here.

It's worth noting that by calling the section "Line Boundary", they're remaining agnostic about the semantics of line counting, whether the boundary represents a line terminator or a line separator.

There's also an interesting note about the semantics of regex . "arbitrary character pattern".

Where the 'arbitrary character pattern' matches a newline sequence, it must match all of the newline sequences, and \u{D A} (CRLF) should match as if it were a single character.

An analogous facility would be useful in Rascal, particularly to be able to subtract character classes from . and have line boundary behavior be transparent. That does mean, though, manipulating simple regex-complexity grammars and not simply character classes. Also on this point, defining "any character that's not a line boundary" is a character class, while "line boundary" is a tiny grammar. Thus, if there's a language-provided symbol for line boundaries, there probably also ought to be one for the complement of a line boundary.

eh-adblockplus-org commented 6 years ago

There is one usage of \R I can recommend: in the documentation for multi-line string values. I am presuming that Rascal source code will use the line boundaries discussed herein, which seems not unreasonable, but is a separate change from allowing grammar authors to use these line boundaries.