influxdata / flux

Flux is a lightweight scripting language for querying databases (like InfluxDB) and working with data. It's part of InfluxDB 1.7 and 2.0, but can be run independently of those.
https://influxdata.com
MIT License
767 stars 154 forks source link

Validate Flux grammar has no ambiguities #1167

Closed nathanielc closed 2 weeks ago

nathanielc commented 5 years ago

We ran into this #1102 which was a grammar ambiguity. Do we have others?

affo commented 5 years ago

The approach I took is to use an ambiguity checker.

There is not a lot of tools that do this, but I took this https://github.com/stepchowfun/cfg-checker.

It only accepts grammars in BNF form. Ours is in EBNF form, so I used gtools to convert it. The result is this: https://gist.github.com/affo/bde5b78bfdecbfe6901965920e1018b6.

I used this file and gave it as an input to cfg-checker and it couldn't find any ambiguity in ~20 minutes.

About #1102, are we sure that is an ambiguity? A bug in the parser doesn't mean that the grammar is ambiguous, it means that probably the parser misunderstands something. Having an ambiguous grammar means that you can generate the same sentence with more than one different set of productions. Is this true?

We told that, if we implement our parser with look ahead 3, we could fix that issue. This would mean that changing the parser implementation would fix an ambiguity in the grammar, and this is impossible. A grammar is ambiguous or not, no matter what parses its sentences.

nathanielc commented 5 years ago

There are at least two possible issues at play here:

  1. Ambiguities in the grammar itself, it seems that we do not have these
  2. The grammar is at least LL(3) as currently written, it that OK? Can we transform it to be LL(1), is it LL(>3) how do we know for sure that our grammar works as we expect it too

First steps are to create issues to answer some of these questions.

wolffcm commented 5 years ago

Article by Terence Parr (the ANTLR guy) arguing in favor of LL(k>1) and LR(k>1) grammars: https://www.antlr2.org/article/needlook.html (I have only read the abstract)

wolffcm commented 5 years ago

Here's a summary of @affo 's and my findings:

Ambiguities

Our grammar does have some ambiguities stemming from the fact that we allow a syntactic shortcut for properties, that is

// function call:
f = (a) => a * a
a = 3
f(a: a) // produces 9
f(a)    // legal; equivalent to above
// likewise for objects:
o = {a} // legal, short for {a: a}
o // produces {a: 3}

This leads to ambiguities since for the Flux

f(a)

There are two valid parse trees that correspond to this script. It could either be a function call:

node: &ast.File{
    Body: []ast.Statement{
        &ast.ExpressionStatement{
            Expression: &ast.CallExpression{
                Callee: &ast.Identifier{Name: "f"},
                Arguments: []ast.Expression{&ast.ObjectExpression{
                    Properties: []*ast.Property{{Key: &ast.Identifier{Name: "a"}},
                },
            },
        },
    },
},

Or it could be two consecutive expression statements:

&ast.File{
    Body: []ast.Statement{
        &ast.ExpressionStatement{
            Expression: &ast.Identifier{Name: "f"},
        },
        &ast.ExpressionStatement{
            Expression: &ast.Identifier{Name: "a"},
        },
    },
},

Do we want to continue to support the short form for properties? There is a case for the answer being no:

Note however that there could be other ambiguities in the grammar that we don't yet know about. cfg-checker did not find the one described above.

Parsing Strategy

Historically ll(k>1) grammars were considered "bad" because automatically generated parsers required space in the order of O(|T|^k), where |T| is the number of tokens and k is the amount of lookahead. This paper discusses this and also provides some helpful background on parsing strategies in general:

https://www.antlr2.org/article/needlook.html

My takeaway from this article is that since our parser is hand-coded, we can avoid the O(|T|^k) space penalty. That is, we can restrict lookahead to the minimum amount needed in any given context. This is effectively a ll(*) parsing strategy, which is what ANTLR uses:

https://www.antlr.org/papers/LL-star-PLDI11.pdf

Proposed Actions

Remove support for short-form properties

Update our grammar and parser to prohibit the short form of properties. Note: breaking change.

Update our parser to employ ll(*)

Update our parser to look ahead up to three tokens in order to differentiate calls from other expressions. This would enable our parser to recognize scripts like this (from #1102):

not (a and b)
(a or b) and c

The parser could look ahead three tokens after the first line to see ( a or. Once the parser sees the or it can deduce that the ( does not start a function call and begin parsing a new expression statement instead.

Moreover, allowing for a variable amount (up to three tokens) of lookhead in our parser would also allow us to stop requiring users to parenthesize the body of lambdas in calls to map:

map(fn: (r) => {_time: r._time, _value: r._value + 1})

Our parser cannot currently handle the above, since it assumes that { begins a statement block. If it could lookahead similar to the example above, the more intuitive way to express a lambda whose body is an object literal would work.

Caveats

If we ever want to support positional parameters in addition to named parameters, moving forward with this strategy could further entrench the status quo.

I have implemented on a branch most of the parser updates required, while I was working on #1102, but set it aside.
I hit some difficulty regarding how we scan for regex literals, since the scanner provides both a Scan and a ScanWithRegex method and allows callers to put just one token back in case the parser wants to peek for a regex literal or not.

It's unclear to me if this mechanism could be extended to three tokens, and even if it is possible the resulting code could be difficult to understand and maintain.

wolffcm commented 5 years ago

For reference here's the Flux grammar in the format that ANTLR uses: https://gist.github.com/wolffcm/61498cd81e1140277bc0b4e9263766fd

affo commented 5 years ago

@wolffcm crystal clear, it looks great! Thank you for the detailed summary 😄 👍

affo commented 5 years ago

A final side note is that adding semi-colons ; to our grammar would disambiguate

f(a);

from

f;(a);

However, that wouldn't help with

f = (r) => {v: r._value}

That would always require 3 look-aheads to find the : and understand that that's not a BlockStatement, but an ObjectExpression.

nathanielc commented 5 years ago

@wolffcm Would having 3 token lookahead remove the need to have Scan vs ScanWithRegex?

My thoughts:

  1. We want the shortcut syntax for calling functions, @pauldix has repeatedly asked for this and our users consistently tell us they want these kinds of less verbose syntax.
  2. We are not going to ever have positional arguments. If we do it will be a whole new version of Flux. Keyword arguments are a foundational design choice so I am not worried about further ingraining it.
  3. It seems that 3 token look ahead will simplify the grammar in a few places already so this feels like an overall good improvement in usability.
  4. If it turns out we can eliminate ScanWithRegex and only have a single Scan method on the scanner that is another win in maintainability of the code.

On ScanWithRegex vs Scan my understanding is this:

We need the different scan methods because when we see a / token we don't know whether it is a division operator or the start of a regex. Right now we treat a regex literal / ... / as a single token. Would it be possible to treat / as a token stand alone and use look ahead to disambiguate / as a binary operator from / as a starting regex delimiter? I see two ways this can work:

  1. Something about binary expressions makes them invalid regex patterns and so with a short look ahead we can distinguish them. I doubt this is possible :(
    1. We already know the context under which / means division vs regex as we call the Scan vs ScanWithRegex methods accordingly, so maybe if we see / in the context of regex we just keep scanning until we see another /. Not sure if this breaks the scanner or not.
wolffcm commented 5 years ago

We want the shortcut syntax for calling functions, @pauldix has repeatedly asked for this and our users consistently tell us they want these kinds of less verbose syntax.

That's fine with me. In that case it makes sense to have similar syntax for object literals too? Just thinking about how to resolve these ambiguities...

If it turns out we can eliminate ScanWithRegex and only have a single Scan method on the scanner that is another win in maintainability of the code.

It's unfortunate that we have both ScanWithRegex vs Scan but I think having these two method calls is the lesser of two evils. Making the parser start with token.DIV and scanning tokens until we find the end of a regex is really the work of a scanner, and it's not a simple task.

Consider that f(/ "foo /, / "foo /) (function call with two regex literal arguments) would confuse the scanner absent a call to ScanWithRegex, since it would attempt to tokenize it like this: f ( / "foo /, / " foo / ).

I think there's a solution somehow, still not sure what it is.

nathanielc commented 5 years ago

@wolffcm The examples look good.

Function calls with literals is not allowed, only with identifiers.

That said your example is still valid with a small edit:

f(a: / "foo /,  b:/ "foo /)

Tokens: f ( / "foo /, b:/ " foo / ).

OK I agree that two scanner functions is the best way forward. Is there a case where we need look ahead across the two functions? Meaning we need to look ahead 3 on Scan and look ahead 3 on ScanWithRegex but we never have to mix the look aheads across the scan method? Do we even need the look ahead with the ScanWithRegex method? Do we have any ambiguities around regex literals?

wolffcm commented 5 years ago

The Scan vs ScanWithRegex is easily solved since we never need to scan past a / to disambiguate a property list from something else.

I've got a PR that implements this parsing strategy here: https://github.com/influxdata/flux/pull/1280

jsternberg commented 5 years ago

I think we need to represent these in the grammar or we're going to end up with more bugs. Whether or not we have lookahead is tangential to that issue.

We should interpret the function body of fn = (r) => {_value} as returning {_value: _value}, rather than a statement block with a single expression statement.

I agree that this makes sense, but how do we do this? It seems to me like this isn't possible for a grammar. I do not think we can have a grammar with a bunch of hand-coded disambiguations otherwise there will be more of these.

I'd like if we talked about the various grammar ambiguities and fixed the EBNF before making any changes to the parser. I don't think it's possible to have a conversation about lookahead until the EBNF grammar matches the proposed changes.

jsternberg commented 5 years ago

I've been thinking about this more and I do think semi-colons are likely the correct solution and creating rules about when they are added. The reason is mostly for usability and because I think increasing the lookahead on the parser won't solve the underlying ambiguities.

An underlying problem is that the current parser needs lookahead to determine the following sequence of tokens:

f(a or b)

According to the grammar, this should be two expressions because the pattern does not match a call expression, but the parser currently sees the ( and assumes that it is a call expression. The correct behavior according to the EBNF can only be determined when you get to or. I think we can all agree that this is the case and there's no ambiguity for the above.

And I think that's a problem. If I were a user and that got parsed as two expressions, I would be confused and the parser error message would probably be nonsensical. In fact, I would guess the parser wouldn't return an error.

Additionally, I think this causes a problem:

a = from(bucket: "telegraf") |> range(start: -5m)
f
(a) |> yield()

In this situation, it would probably surprise the user to learn that (a) is being used as a call on the function. Maybe the call to f was unintentional or there's something more complicated on the previous line that gets parsed as an expression. I think f() |> yield() would cause the same parser irregularity in the above as just f.

I think there's an expectation that a newline means something even if we specify in the grammar it means nothing. This makes me think that creating rules for when newlines matter, which is what automatic semicolon insertion is, would clear up these ambiguities and give a better user experience.

affo commented 5 years ago

I do like the semicolon insertion strategy, and I think you are right, newlines create expectations.

Adding semicolons would eliminate the ambiguity

f(a)

VS

f
(a)

And that's cool.

For the case f(a or b), wouldn't it simply error because of a missing comma? I don't think you need increased lookahead for it, and it would error as expected:

> f(a or b)
Error: unsupported key <nil>

Increased look-ahead is required for:

f = (r) => {return x}

VS

f = r => {x: r.x}

To discriminate block VS object (it is weird to ask the user to add extra parenthesis).

The case:

f(r) => {_value}

Is a matter of choosing erroring or not. Ambiguities are caused by a single rule we decided to include to reduce verbosity: {a} === {a: a}. This impacts both objects and arguments to functions (as they are objects), and clearly makes us decide on all its implications. I don't think this would lead us to introduce more ambiguities and special cases in the future.

It seems to me that {a} === {a: a} was introduced just to make arguments in a call behave as positional arguments when using the right identifiers. I don't know how useful it is for specifying objects and how much it is used. So, why don't we separate the arguments in a call from objects in the grammar, so that that rule would apply only for function calls?

As a consequence f(r) => {_value} wouldn't be allowed. It seems fair, why would you do it in any case? Most of the times it would be: f(r) => {_value: r._value}. This seems a good trade-off to me:

jsternberg commented 5 years ago

For the case f(a or b), wouldn't it simply error because of a missing comma? I don't think you need increased lookahead for it, and it would error as expected:

Without semi-colon insertion, it would not. It's not ambiguous. Ambiguity is easier to solve because we just declare that it's a function call and move on with our lives. It's unambiguously two expressions without semi-colons. It's the only pattern that matches and that's the problem.

affo commented 5 years ago

👍 Now I got your point on f(a or b). When reading your comment I couldn't get your point because if you write f(a or b) it is a function call to me, no way that it could be two expressions! And I think that a user would be surprised if the parser interprets them as so, possibly giving a weird error later.

y = 42
x = f(a or b)
z = x + y

If this gets parsed as

y = 42
x = f
(a or b)
z = x + y

Then we would error with something like Cannot sum function and int, which, I think, would surprise the user!

Even more convinced of the importance of semicolon insertion.

affo commented 5 years ago

In general, for me, the two points are orthogonal:

In general, I am in favor of introducing both.

wolffcm commented 5 years ago

I have reached the same conclusion as Lorenzo. I agree that newlines are really useful for understanding the user's intent, so inserting semicolons where there are newlines would help the parser do that.

I also think that lookahead would still be useful for simplifying the grammar. Even in places where we can use left-refactoring to keep the grammar ll(1), I think it makes the grammar harder to understand and complicates the parser.

E.g., for the blocks/objects issue:

fn = (r) => { foo: r.foo + 1 }

A parser would have to see the : before it knows if the function body is an object or a block. A left-refactoring would look something like this?

FunctionBodyExpression   := "{" BlockOrExpessionSuffix
                         | Expression
BlockOrExpressionSuffix  := identifier BlockOrExpressionSuffix2
                         | stringLit BlockOrExpressionSuffix2
                         | BlockSuffix1
BlockOrExpressionSuffix2 := ":" ObjectSuffix
                         | BlockSuffix2
ObjectSuffix := ...
BlockSuffix1 := ...
BlockSuffix2 := ...

This seems unwieldy to me. In contrast, just looking ahead to the : is relatively simple.

wolffcm commented 5 years ago

Given that the part we all agree on (semicolon insertion) is out-of-scope for the short term, it's unclear to me what the immediate next step is.

jsternberg commented 5 years ago

Left factoring is mostly automatic and greatly simplifies the actual code at the cost of making it a bit harder to read the EBNF. Left factoring is usually done to simplify the code, but we can have a discussion about that later.

I don't think there needs to be something for the short term. The current code works even if it isn't the most user friendly. I think as long as we modify the existing parser to return errors, then I don't think we have to do something.

github-actions[bot] commented 3 weeks ago

This issue has had no recent activity and will be closed soon.