Closed nathanielc closed 2 months ago
The approach I took is to use an ambiguity checker.
There is not a lot of tools that do this, but I took this https://github.com/stepchowfun/cfg-checker.
It only accepts grammars in BNF form. Ours is in EBNF form, so I used gtools to convert it. The result is this: https://gist.github.com/affo/bde5b78bfdecbfe6901965920e1018b6.
I used this file and gave it as an input to cfg-checker
and it couldn't find any ambiguity in ~20 minutes.
About #1102, are we sure that is an ambiguity? A bug in the parser doesn't mean that the grammar is ambiguous, it means that probably the parser misunderstands something. Having an ambiguous grammar means that you can generate the same sentence with more than one different set of productions. Is this true?
We told that, if we implement our parser with look ahead 3, we could fix that issue. This would mean that changing the parser implementation would fix an ambiguity in the grammar, and this is impossible. A grammar is ambiguous or not, no matter what parses its sentences.
There are at least two possible issues at play here:
First steps are to create issues to answer some of these questions.
Article by Terence Parr (the ANTLR guy) arguing in favor of LL(k>1) and LR(k>1) grammars: https://www.antlr2.org/article/needlook.html (I have only read the abstract)
Here's a summary of @affo 's and my findings:
Our grammar does have some ambiguities stemming from the fact that we allow a syntactic shortcut for properties, that is
// function call:
f = (a) => a * a
a = 3
f(a: a) // produces 9
f(a) // legal; equivalent to above
// likewise for objects:
o = {a} // legal, short for {a: a}
o // produces {a: 3}
This leads to ambiguities since for the Flux
f(a)
There are two valid parse trees that correspond to this script. It could either be a function call:
node: &ast.File{
Body: []ast.Statement{
&ast.ExpressionStatement{
Expression: &ast.CallExpression{
Callee: &ast.Identifier{Name: "f"},
Arguments: []ast.Expression{&ast.ObjectExpression{
Properties: []*ast.Property{{Key: &ast.Identifier{Name: "a"}},
},
},
},
},
},
Or it could be two consecutive expression statements:
&ast.File{
Body: []ast.Statement{
&ast.ExpressionStatement{
Expression: &ast.Identifier{Name: "f"},
},
&ast.ExpressionStatement{
Expression: &ast.Identifier{Name: "a"},
},
},
},
Do we want to continue to support the short form for properties? There is a case for the answer being no:
Note however that there could be other ambiguities in the grammar that we don't yet know about. cfg-checker
did not find the one described above.
Historically ll(k>1)
grammars were considered "bad" because automatically generated parsers
required space in the order of O(|T|^k)
, where |T|
is the number of tokens and k
is
the amount of lookahead. This paper discusses this and also provides some helpful background
on parsing strategies in general:
https://www.antlr2.org/article/needlook.html
My takeaway from this article is that since our parser is hand-coded, we can avoid the O(|T|^k)
space penalty. That is, we can restrict lookahead to the minimum amount needed in any given context. This is effectively a ll(*)
parsing strategy, which is what ANTLR uses:
https://www.antlr.org/papers/LL-star-PLDI11.pdf
Update our grammar and parser to prohibit the short form of properties. Note: breaking change.
ll(*)
Update our parser to look ahead up to three tokens in order to differentiate calls from other expressions. This would enable our parser to recognize scripts like this (from #1102):
not (a and b)
(a or b) and c
The parser could look ahead three tokens after the first line to see (
a
or
. Once the parser
sees the or
it can deduce that the (
does not start a function call and begin parsing a
new expression statement instead.
Moreover, allowing for a variable amount (up to three tokens) of lookhead in our parser would
also allow us to stop requiring users to parenthesize the body of lambdas in calls to map
:
map(fn: (r) => {_time: r._time, _value: r._value + 1})
Our parser cannot currently handle the above, since it assumes that {
begins a statement block.
If it could lookahead similar to the example above, the more intuitive way to express a lambda
whose body is an object literal would work.
If we ever want to support positional parameters in addition to named parameters, moving forward with this strategy could further entrench the status quo.
I have implemented on a branch most of the parser updates required, while I was working on #1102, but set it aside.
I hit some difficulty regarding how we scan for regex literals, since the scanner
provides both a Scan
and a ScanWithRegex
method and allows callers to put just one token back
in case the parser wants to peek for a regex literal or not.
It's unclear to me if this mechanism could be extended to three tokens, and even if it is possible the resulting code could be difficult to understand and maintain.
For reference here's the Flux grammar in the format that ANTLR uses: https://gist.github.com/wolffcm/61498cd81e1140277bc0b4e9263766fd
@wolffcm crystal clear, it looks great! Thank you for the detailed summary 😄 👍
A final side note is that adding semi-colons ;
to our grammar would disambiguate
f(a);
from
f;(a);
However, that wouldn't help with
f = (r) => {v: r._value}
That would always require 3 look-aheads to find the :
and understand that that's not a BlockStatement
, but an ObjectExpression
.
@wolffcm Would having 3 token lookahead remove the need to have Scan
vs ScanWithRegex
?
My thoughts:
On ScanWithRegex
vs Scan
my understanding is this:
We need the different scan methods because when we see a /
token we don't know whether it is a division operator or the start of a regex. Right now we treat a regex literal / ... /
as a single token. Would it be possible to treat /
as a token stand alone and use look ahead to disambiguate /
as a binary operator from /
as a starting regex delimiter? I see two ways this can work:
/
means division vs regex as we call the Scan
vs ScanWithRegex
methods accordingly, so maybe if we see /
in the context of regex we just keep scanning until we see another /
. Not sure if this breaks the scanner or not.We want the shortcut syntax for calling functions, @pauldix has repeatedly asked for this and our users consistently tell us they want these kinds of less verbose syntax.
That's fine with me. In that case it makes sense to have similar syntax for object literals too? Just thinking about how to resolve these ambiguities...
f(a)
as a function call. We already do this.fn = (r) => {_value}
as returning {_value: _value}
, rather than a statement block with a single expression statement.If it turns out we can eliminate ScanWithRegex and only have a single Scan method on the scanner that is another win in maintainability of the code.
It's unfortunate that we have both ScanWithRegex
vs Scan
but I think having these two method calls is the lesser of two evils. Making the parser start with token.DIV
and scanning tokens until we find the end of a regex is really the work of a scanner, and it's not a simple task.
Consider that f(/ "foo /, / "foo /)
(function call with two regex literal arguments) would confuse the scanner absent a call to ScanWithRegex
, since it would attempt to tokenize it like this: f
(
/
"foo /, / "
foo
/
)
.
I think there's a solution somehow, still not sure what it is.
@wolffcm The examples look good.
Function calls with literals is not allowed, only with identifiers.
That said your example is still valid with a small edit:
f(a: / "foo /, b:/ "foo /)
Tokens: f
(
/
"foo /, b:/ "
foo
/
)
.
OK I agree that two scanner functions is the best way forward. Is there a case where we need look ahead across the two functions? Meaning we need to look ahead 3 on Scan
and look ahead 3 on ScanWithRegex
but we never have to mix the look aheads across the scan method? Do we even need the look ahead with the ScanWithRegex
method? Do we have any ambiguities around regex literals?
The Scan
vs ScanWithRegex
is easily solved since we never need to scan past a /
to disambiguate a property list from something else.
I've got a PR that implements this parsing strategy here: https://github.com/influxdata/flux/pull/1280
I think we need to represent these in the grammar or we're going to end up with more bugs. Whether or not we have lookahead is tangential to that issue.
We should interpret the function body of fn = (r) => {_value} as returning {_value: _value}, rather than a statement block with a single expression statement.
I agree that this makes sense, but how do we do this? It seems to me like this isn't possible for a grammar. I do not think we can have a grammar with a bunch of hand-coded disambiguations otherwise there will be more of these.
I'd like if we talked about the various grammar ambiguities and fixed the EBNF before making any changes to the parser. I don't think it's possible to have a conversation about lookahead until the EBNF grammar matches the proposed changes.
I've been thinking about this more and I do think semi-colons are likely the correct solution and creating rules about when they are added. The reason is mostly for usability and because I think increasing the lookahead on the parser won't solve the underlying ambiguities.
An underlying problem is that the current parser needs lookahead to determine the following sequence of tokens:
f(a or b)
According to the grammar, this should be two expressions because the pattern does not match a call expression, but the parser currently sees the (
and assumes that it is a call expression. The correct behavior according to the EBNF can only be determined when you get to or
. I think we can all agree that this is the case and there's no ambiguity for the above.
And I think that's a problem. If I were a user and that got parsed as two expressions, I would be confused and the parser error message would probably be nonsensical. In fact, I would guess the parser wouldn't return an error.
Additionally, I think this causes a problem:
a = from(bucket: "telegraf") |> range(start: -5m)
f
(a) |> yield()
In this situation, it would probably surprise the user to learn that (a)
is being used as a call on the function. Maybe the call to f
was unintentional or there's something more complicated on the previous line that gets parsed as an expression. I think f() |> yield()
would cause the same parser irregularity in the above as just f
.
I think there's an expectation that a newline means something even if we specify in the grammar it means nothing. This makes me think that creating rules for when newlines matter, which is what automatic semicolon insertion is, would clear up these ambiguities and give a better user experience.
I do like the semicolon insertion strategy, and I think you are right, newlines create expectations.
Adding semicolons would eliminate the ambiguity
f(a)
VS
f
(a)
And that's cool.
For the case f(a or b)
, wouldn't it simply error because of a missing comma? I don't think you need increased lookahead for it, and it would error as expected:
> f(a or b)
Error: unsupported key <nil>
Increased look-ahead is required for:
f = (r) => {return x}
VS
f = r => {x: r.x}
To discriminate block VS object (it is weird to ask the user to add extra parenthesis).
The case:
f(r) => {_value}
Is a matter of choosing erroring or not. Ambiguities are caused by a single rule we decided to include to reduce verbosity: {a} === {a: a}
. This impacts both objects and arguments to functions (as they are objects), and clearly makes us decide on all its implications. I don't think this would lead us to introduce more ambiguities and special cases in the future.
It seems to me that {a} === {a: a}
was introduced just to make arguments in a call behave as positional arguments when using the right identifiers. I don't know how useful it is for specifying objects and how much it is used. So, why don't we separate the arguments in a call from objects in the grammar, so that that rule would apply only for function calls?
As a consequence f(r) => {_value}
wouldn't be allowed. It seems fair, why would you do it in any case? Most of the times it would be: f(r) => {_value: r._value}
. This seems a good trade-off to me:
with
, people would stop to copy values in objects.For the case f(a or b), wouldn't it simply error because of a missing comma? I don't think you need increased lookahead for it, and it would error as expected:
Without semi-colon insertion, it would not. It's not ambiguous. Ambiguity is easier to solve because we just declare that it's a function call and move on with our lives. It's unambiguously two expressions without semi-colons. It's the only pattern that matches and that's the problem.
👍 Now I got your point on f(a or b)
. When reading your comment I couldn't get your point because if you write f(a or b)
it is a function call to me, no way that it could be two expressions! And I think that a user would be surprised if the parser interprets them as so, possibly giving a weird error later.
y = 42
x = f(a or b)
z = x + y
If this gets parsed as
y = 42
x = f
(a or b)
z = x + y
Then we would error with something like Cannot sum function and int
, which, I think, would surprise the user!
Even more convinced of the importance of semicolon insertion.
In general, for me, the two points are orthogonal:
f(a)
case);In general, I am in favor of introducing both.
I have reached the same conclusion as Lorenzo. I agree that newlines are really useful for understanding the user's intent, so inserting semicolons where there are newlines would help the parser do that.
I also think that lookahead would still be useful for simplifying the grammar. Even in places where we can use left-refactoring to keep the grammar ll(1), I think it makes the grammar harder to understand and complicates the parser.
E.g., for the blocks/objects issue:
fn = (r) => { foo: r.foo + 1 }
A parser would have to see the :
before it knows if the function body is an object or a block. A left-refactoring would look something like this?
FunctionBodyExpression := "{" BlockOrExpessionSuffix
| Expression
BlockOrExpressionSuffix := identifier BlockOrExpressionSuffix2
| stringLit BlockOrExpressionSuffix2
| BlockSuffix1
BlockOrExpressionSuffix2 := ":" ObjectSuffix
| BlockSuffix2
ObjectSuffix := ...
BlockSuffix1 := ...
BlockSuffix2 := ...
This seems unwieldy to me. In contrast, just looking ahead to the :
is relatively simple.
Given that the part we all agree on (semicolon insertion) is out-of-scope for the short term, it's unclear to me what the immediate next step is.
Left factoring is mostly automatic and greatly simplifies the actual code at the cost of making it a bit harder to read the EBNF. Left factoring is usually done to simplify the code, but we can have a discussion about that later.
I don't think there needs to be something for the short term. The current code works even if it isn't the most user friendly. I think as long as we modify the existing parser to return errors, then I don't think we have to do something.
This issue has had no recent activity and will be closed soon.
We ran into this #1102 which was a grammar ambiguity. Do we have others?