qwertie / LoycCore

The Loyc Core Libraries. Loyc.Essentials fills in gaps in the .NET Base Class Library; Loyc.Collections adds sophisticated data structures; and Loyc.Syntax contains the LES parser and other parsing-related services including base classes for LLLPG.
Other
44 stars 10 forks source link

Reduce whitespace sensitivity of LES #3

Open qwertie opened 9 years ago

qwertie commented 9 years ago

It occurs to me that some people might not like the whitespace sensitivity of LES. I'm not really a big fan myself - I'm concerned that the parsing rules are too complicated and that people won't "get" them.

The core problem is about superexpressions and newlines. If the user writes

cat += mew
Foo(); // two statements? or one statement meaning `cat += mew(Foo())`?

They probably just forgot a semicolon, and two statements was intended. On the other hand, in

cat += new Foo();

Clearly new Foo() is intended to be a superexpression. This implies we need a whitespace-aware parser to detect missing semicolons, lest the missing semicolon in the first example go undetected. The parser needs special logic to see the line break, notice that the second line is not indented in comparison with the first line, print a warning, and assume that the user intended a semicolon (but forgot).

But this logic means that (outside any other braced block or brackets),

class Foo
{
};

is treated as two separate statements, even though the user no doubt intended to write a single statement. Therefore LES currently allows a start-of-line colon to indicate "no, I really meant to continue this statement on the next line":

class Foo
:{
};

This is inconvenient for the user, and supporting the colon currently involves some tricky logic between the lexer and the parser. Can we avoid these difficulties?

I've given it some thought, and I think I have a solution that eliminates most of LES's whitespace sensitivity, and allows us to eliminate all the logic around line breaks (except that I'd still like to allow parsers to support Python mode). It breaks backward compatibility, but I think it will be easier for language pundits to accept. My proposal is:

  1. The lexer should look for a whitespace character at the end of every identifier. If present, the token type is IdSpace, otherwise it's Id. This will be used to detect superexpressions, instead of the old way, which was that the parser would compare the end-position of one token against the start position of another.
  2. In a statement-parsing context,
    • a call in prefix notation is not permitted as the target of a superexpression, i.e. foo(x) expr will normally be treated as a syntax error, and foo(x) (expr) is treated as a double-call equivalent to foo(x)(y) (without the space).
    • outside brackets, a superexpression can only begin at the very beginning of the statement. x = new Foo(); will be a syntax error. Write x = (new Foo()); or x = new(Foo()); or even x =newFoo(); instead.
    • a comma is permitted as a separator between any two expressions in a superexpression at the statement level.
    • if an expression in the superexpression begins with a braced block, then the braced block is the entire expression, and the closing brace counts as the end of the statement, unless the closing brace is followed by a comma (,), which is consumed. Note that this rule applies only if the braced block appears at the beginning of an expression. Thus get { return x; } set { x = value; } counts as two statements, but if x > {y} then { Foo(); } counts as a single statement. Temporarily, a warning will be printed if the closing brace isn't followed by a semicolon, colon, comma, closer, or EOF, while we change old LES code to reflect the new rules.
    • the new lexer rule (1) means comments can subtly alter the meaning: foo/**/(x) would now be parsed as a normal function call instead of a superexpression.
  3. Inside parentheses or square brackets, comma-separation is no longer allowed (since comma separates items in an argument list or tuple) and the superexpression target need not be the first thing in the expression.
    • a call in prefix notation is not permitted as the target of a superexpression.
  4. Colon will no longer be recognized as a way to continue a statement after a line break, but colon will be allowed as a way of asking for a superexpression to continue after a braced block, in both contexts in which superexpressions can appear, e.g. if x > 0 {...}: else {...}. Colon in this context has the same effect as a comma at the statement level.
    • In other circumstances, colon is recognized as a binary operator, e.g. x: int. For now, colon will no longer be recognized as a prefix operator (though derivatives of colon, like |:, can still be used as prefix operators)

Under these new rules, problem cases like these...

cat = 1 + mew
Foo();

Foo()
Bar();

are syntax errors. Meanwhile,

if { Foo(); } else Bar();

is parsed as two separate statements, with a warning; to treat them as one statement, one must write

if { Foo(); }, else Bar();

I still see one potential problem case: code like

var x = 0
Foo()

will be parsed as a single statement, which may be unintentional. If this is a serious concern, a bit of extra logic could produce a warning in this situation (heuristic: there's no semicolon, and the second line is neither indented, nor started with an opening brace.)