tree-sitter / tree-sitter-rust

Rust grammar for tree-sitter
MIT License
340 stars 97 forks source link

Implement let_chains. #152

Closed goffrie closed 1 year ago

goffrie commented 2 years ago

Instead of a separate if_let_expression construct, we add a let_condition node that can be the condition field of an if_expression. Furthermore, we add a let_chain node that can hold any number of let_condition nodes in a chain (as well as regular expression nodes). This also applies to while let and match pattern guards.

maxbrunsfeld commented 2 years ago

Overall, I really like the way you did this, and I'd like to merge it, since I know this feature is ~stabilizing~ stable.

My only worry is that with this change, the && operator will sometimes be parsed as the operator in a binary_expression, and sometimes (when at the top level of a condition) it will not be parsed that way. Maybe that's just the reality of what && now means in Rust?

It seems possible to address this (by introducing a new binary_expression-like rule whose operands can also be let_conditions), but I'm not sure if it's worth the complexity. Just wanted to raise the issue to see what you think.

/cc @jorendorff - I'm curious about your opinion on the best way to model this.

jorendorff commented 2 years ago

Some tests to include:

if foo && bar {} - however this is parsed, a test is wise

if foo && bar || baz {} - this one is tricky; it has to parse as binary expressions

if foo && let bar = || baz && quux {} - apparently closures always extend as far as possible, regardless of whatever else is going on; this parses as if foo && let bar = (|| baz && quux) {}.

if a && let b = || c || d && e {} for good measure, maybe? Again, the closure extends all the way to the {.

jorendorff commented 2 years ago

OK, how to model this...

I like how this PR models the non-chaining case if let P = E {}. It produces (if_expression condition: (let_condition)) and that is great.

Now take if ok && let Some(v) = x && p(v) {} as an example.

  1. This PR models this using multiple condition: named children. That's an unfamiliar pattern to me; I don't think the Rust grammar can produce multiple named children with the same name anywhere else. Maybe that's normal in tree-sitter; I don't know. Here are two alternatives I like better.

  2. Treat the let condition as an expression that's an operand of &&, something like:

    (if_expression condition: (binary_expression
      left: (binary_expression
        left: (identifier) "&&"
        right: (let_condition))
        "&&"
      right: (call_expression)))

    That would be all right. This is what the Rust compiler does.

    However, for building scopes using tree-sitter it might be bad. There'd be two very different rules for scoping and-expressions, and it's not possible to query for "a binary_expression but only if any operand (even nested!) is a let_condition". So a third option:

  3. Parse it as: (if_expression condition: (let_chain (identifier) (let_condition) (call_expression))).

    The (let_chain) node would appear only when the condition contains both let and &&. In the absence of let, && would continue to parse as a (binary_expression).

    I would like this best. It's clear and easy to query and use. I just don't know if it's implementable.

jorendorff commented 2 years ago

I know this feature is ~stabilizing~ stable.

It'll be stable in 1.64 which is the next release. We've got about a month.

goffrie commented 2 years ago

if foo && bar || baz {} - this one is tricky; it has to parse as binary expressions

Good point. This is not handled correctly right now.

maxbrunsfeld commented 2 years ago

Yeah, I like the let_chain idea. I think ideally, you'd still create a binary_expression node wherever possible.

So for this code:

if a && b let Some(c) = d && e && f {
}

You'd get a tree like:

(if_expression
  condition: (let_chain
    (binary_expression ...)
    (let_condition ...)
    (binary_expression ...))
  body: (block))

It seems doable. @goffrie Are you up for exploring that direction?

goffrie commented 2 years ago
  1. Treat the let condition as an expression that's an operand of &&

So I really wanted to avoid allowing "let expressions" in arbitrary places, because they are not really allowed in Rust except in very specific places (you aren't even allowed to put parens around one). But having a separate construct for && in a condition creates ambiguity that I'm not sure tree-sitter's precedence rules are powerful enough to resolve. (Not that I have any experience with tree-sitter at all, mind you...)

So I'm tempted to just put let conditions into the expression grammar. Which as you noted, is basically how rustc's parser works anyway. (It also emits a parse-time error in some cases, such as if a || let b = c - but not all of them, e.g. if a && let b = c || d because it doesn't realize the let-expression is nested until after it's been created.)

goffrie commented 2 years ago
if a && b let Some(c) = d && e && f {
}
(if_expression
  condition: (let_chain
    (binary_expression ...)
    (let_condition ...)
    (binary_expression ...))
  body: (block))

Feasibility aside, I don't think I agree with this parse tree. It implies that e && f is a sub-expression, but that's not actually the case since && is left-associative.

Also, I have no idea how this might be implemented :)

maxbrunsfeld commented 2 years ago

Feasibility aside, I don't think I agree with this parse tree. It implies that e && f is a sub-expression, but that's not actually the case since && is left-associative.

Are you saying you wouldn't have the binary expression nodes at all, and instead a, b, e and f would all be direct children of the let_chain?

I don't like that, because it seems unnecessarily inconsistent with the parse tree you'd get if you removed the let condition let Some(c) = d.

goffrie commented 2 years ago

I think the only consistent parse tree would be one of:

(let_chain
  (let_chain
    (let_chain
      (binary_expression (identifier) (identifier))
      (let_condition ...))
    (identifier))
  (identifier))

(let_chain
  (identifier)
  (identifier)
  (let_condition ...)
  (identifier)
  (identifier))
maxbrunsfeld commented 2 years ago

I think the only consistent parse tree would be...

I don't think we should use the first parse tree, because it makes it much harder to find all of the patterns (which are important because they introduce new bindings).

I would be ok with the second parse tree. But I still think the parse tree that I suggested previously would be even better. Why not parse e && f as a binary expression? That sub-expression (e && f) is evaluated exactly like a normal binary expression, as long as the preceding conditions in the chain succeed, right?

goffrie commented 2 years ago

That's only correct insofar as && happens to be associative (in the mathematical sense) - from a syntactic point of view, a && b && c && d && e && f should be parsed like ((((a && b) && c) && d) && e) && f, no? Which doesn't include e && f as a subtree anywhere.

Putting it another way: today, in the source if a && b && c && d && e && f { ... }, the parent node of e is a && b && c && d && e. But under your proposal, if I changed the source to if a && b let Some(c) = d && e && f, e's parent would be e && f instead. Which would be pretty mysterious, e.g. if using the tree for "expand selection" in an editor.

maxbrunsfeld commented 2 years ago

Yeah, either way (your 2nd proposal or my proposal), the tree would change somewhat when adding a let_condition.

With my proposal, the hierarchy of the binary expressions would change, because you've introduced a let_condition, which binds more tightly a binary expression. With your proposal, all binary expressions would be completely removed, flattening the entire tree of binary expressions into one node.

Am i missing something? It seem like the behavior of 'expand selection' would be much more consistent if you kept the binary expressions under the let chain, as opposed to removing them all.

maxbrunsfeld commented 2 years ago

Having said that, I'm fine with doing whichever is simpler from an implementation perspective.

goffrie commented 2 years ago

I implemented my version of the proposal (flattened structure) using dynamic precedence. I'm not at all sure if this is a "good idea", but the idea is that we prefer to parse the condition as just a regular expression (which can't have a let-condition in it), or a single let_condition, but if it has nested lets inside of a && it becomes a let_chain node instead (which can have any number of children).

I think all the precedence rules are roughly right now, at least for correct code. Unfortunately, now if let a = b && c || d is accepted when it should not be. It parses as if (let a = b) && (c || d), which is wrong - the correct parse is if ((let a = b) && c) || d, which is then rejected.

goffrie commented 2 years ago

Am i missing something? It seem like the behavior of 'expand selection' would be much more consistent if you kept the binary expressions under the let chain, as opposed to removing them all.

You are correct. For this use case the "nested let_chain" parse would be the most consistent, though (as it would expand the selection to the left, instead of to the right); but I agree with you that it's less practical for other uses.

Anyhow, I actually have no idea how to implement your proposal also 😄

jorendorff commented 2 years ago

It seem like the behavior of 'expand selection' would be much more consistent if you kept the binary expressions under the let chain, as opposed to removing them all.

Oh, I disagree. This would seem super weird to me, as a user. The way you'd typically see this construct formatted is like

    if first_variable_name.is_pretty_long()
        && second_variable.nothing_to_sneeze_at_either()
        && let Some(cfg) = the_config_record_if_any
        && cfg.enable_rambutans
        && cfg.enable_arugula
    {
        ...
    }

These should be treated as siblings, imho. I don't think a peculiar nesting will end up having more intuitive behavior in practice, and I think there's some minor risk in picking an interpretation that's so different from how the Rust team thinks of the feature.

maxbrunsfeld commented 2 years ago

The way you'd typically see this construct formatted is like

That's the same way that you format nested binary expressions though, and they are structured as a left-associative tree:

    if first_variable_name.is_pretty_long()
        && second_variable.nothing_to_sneeze_at_either()
        && the_config_record_if_any.is_some()
        && cfg.enable_rambutans
        && cfg.enable_arugula
    {
       //  ...
    }

I think there's some minor risk in picking an interpretation that's so different from how the Rust team thinks of the feature.

I agree with that, but I thought it was an argument in favor of keeping the binary expressions. Doesn't rustc parse these things as binary expressions? I thought that's what you were saying in your earlier comment.

jorendorff commented 1 year ago

I agree with that, but I thought it was an argument in favor of keeping the binary expressions. Doesn't rustc parse these things as binary expressions? I thought that's what you were saying in your earlier comment.

I'm really sorry for letting this sit for a month. I didn't know it was my turn. Fortunately we got a short reprieve, since the feature was bumped back to unstable due to bugs.

Rustc uses binary trees, but I worry more about how Rust programmers think about the structure of the code, because (1) we should ideally match what's already in people's brains, and (2) if our AST is idiosyncratic, future evolution of the language is a hazard. Left-associative binary trees vs. vectors in particular seems like a total implementation detail to me; parsers tend to use these almost interchangeably, even within a parser (including rustc_parse).

Taking up the example again, here's what I think about the possible parses:

  1. Vector-like, (let_chain EXPR1 EXPR2 DECL3 EXPR4 EXPR5). This is best. We already use repeat for many things in the grammar, including block syntax, the only other place where let appears! As in blocks, each declaration would be in scope for later siblings.

  2. Left-associative binary tree, (let_chain (let_chain (let_chain (let_chain EXPR1 EXPR2) DECL3) EXPR4) EXPR5). I can live with this, but look at how the scope nesting is the opposite of the tree structure. Deeper-nested nodes are in less-nested scopes. Weird, right? (A right-associative tree would have the opposite nesting, but nothing in the grammar is actually right-associative, so I think people would rightly find that astonishing.)

  3. Integrated with expression syntax, (binary_expr && (binary_expr && (binary_expr && (binary_expr && EXPR1 EXPR2) (let_expr DECL3)) EXPR4) EXPR5). This is most like what rustc does. However, I think a departure from rustc here is consistent with existing departures elsewhere, espcially as let_expr would be invalid almost everywhere expressions are allowed. Also my earlier comment applies here:

    for building scopes using tree-sitter it might be bad. There'd be two very different rules for scoping and-expressions, and it's not possible to query for "a binary_expression but only if any operand (even nested!) is a let_condition".

  4. Mixed, maximizing the number of binary_expression nodes. (let_chain (binary && EXPR1 EXPR2) DECL3 (binary && EXPR4 EXPR5)). This is what I understand you to be proposing. To me this is the least desirable option. The AST structure it produces is more complex than a flat vector or left-associative tree, but the extra information doesn't carry any useful meaning. It suggests && has two different precedence levels.

maxbrunsfeld commented 1 year ago

here's what I think about the possible parses:

Yeah, for me, the choice is between 1 (a flat sequence of expressions and let bindings, delimited by &&) or 4 (the same, but still allowing the list to contain binary expressions with && as the operator). I have a slight preference for 4, because there would be more consistency between the following two syntax trees:

if a
    && b
    && c
    && d {
    body()
}
if let P(x) = a
    && b
    && c
    && d {
    body()
}

With option 1, the addition of one "let condition" changes the structure of the entire if condition. With option 4, it leaves more of the structure intact. So IDE features like extend selection will work more consistently with option 4 than they will with option 1. Just to clarify @jorendorff - do you agree that this is a benefit of option 4 over option 1? I'm not claiming that this one benefit trumps all other concerns, but I'm not totally clear on whether you're acknowledging this aspect of the tradeoff.

The AST structure it produces is more complex than a flat vector or left-associative tree, but the extra information doesn't carry any useful meaning. It suggests && has two different precedence levels.

I agree with you about the two different precedence levels of &&. Both option 1 and option 4 have this drawback - that && now has two different meanings. The only way to prevent this would be to choose option 3, and always parse && as the operator in a binary expression.

The argument about the "complexity" of the AST is a bit unclear to me. The way that I'd assess the complexity is by thinking from the perspective of writing code (or queries) to consume this syntax tree. From the lense, I don't think that option 4 increases the complexity of that code at all, compared to option 1. Any code that's analyzing a let_chain already needs to be able to handle the existence of binary_expression nodes within the chain. So I don't think there's any complexity increase due to allowing these binary expressions to have && as an operator.

TLDR, for me the decision between options 1 and 4 comes down to:

option 1 option 4
'extend selection' in editors :-1: less consistent :+1: more consistent
code analysis :+1: straightforward to locate pattern bindings :+1: also straightforward
&& operator :-1: two distinct meanings :-1: also two meanings
implementation status :+1: is already implemented in this PR ❓ has not been tried
jorendorff commented 1 year ago

Apologies all day for how long this is...

IDE features like extend selection will work more consistently with option 4 than they will with option 1. Just to clarify @jorendorff - do you agree that this is a benefit of option 4 over option 1?

I agree consistency of IDE behavior is as important a consideration as any other use case. I don't know anything about this IDE feature, so I'm liable to make mistakes trying to reason about it. But here goes.

The argument about the "complexity" of the AST is a bit unclear to me. The way that I'd assess the complexity is by thinking from the perspective of writing code (or queries) to consume this syntax tree. From the lense, I don't think that option 4 increases the complexity of that code at all, compared to option 1.

...into deep nerd territory...

One fairly rigorous way to understand "complexity" is in terms of information conveyed to a human when they read the output. That argument goes like this:

There's something subtle going on here where a human user has to look at a small number of examples and infer the behavior of the tool, and that is work for the human. (Maybe I am wrong that this is how normal people learn what kind of output tree-sitter produces, but personally I do this a lot!) They may end up writing exactly the same code to consume either construct, as you say, but they'll have extra bits in their brain afterwards if we choose the offbeat one. Expensive bits!

Another way is the Kolmogorov complexity of the output: how well does it compress? The flatter AST would be shorter compressed (or uncompressed, come to think of it).

I hesitated before writing "complexity" because, you know, it's pretty subjective. But I think it's correct here.

maxbrunsfeld commented 1 year ago

Ok, well thanks for taking the time to make that argument.

Overall, I do agree with your points about complexity. I still don't really agree about the 'extend selection' issue, but that's ok. The current structure is implemented well, and there are very reasonable arguments that it is the best structure.

@goffrie, I spent a little time removing the conflicts that were added here. FWIW, the way that I avoided the local ambiguities with binary_expression was to restructure let_chain as a left-associative binary tree internally. I hid the left-associative inner nodes, so that from the user's perspective, a let_chain still looks like a flat list.