rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.19k stars 12.56k forks source link

Parse error recovery is obversable by macros in several cases #103534

Open fmease opened 1 year ago

fmease commented 1 year ago

Introduction

In multiple cases, macro fragment specifiers like expr and stmt match more token streams than they should as a consequence of the parser trying to recover from obviously invalid Rust code to provide better immediate and subsequent error messages to the user.

Why Is This a Concern?

The user should be allowed to assume that a fragment specifier only matches valid Rust code, anything else would make the fragment specifier not live up to its name and as a result render it useless (to exaggerate).

One use of macros is the ability to define embedded / internal domain-specific languages (DSLs). Part of that is defining new syntax which might not necessarily be valid Rust syntax. Declarative macros allow users to create several macro rules / matchers enabling relatively fine-grained matching on tokens. Obviously, when writing those rules, macro authors need to know what a given fragment specifier accepts in order to confidently determine which specific rule applies for a given input. If the grammar used by a fragment specifier is actually larger than promised and basically unknown (implementation-defined to be precise), this becomes an impossible task.

Not only that. If we don't do anything, the grammars matched by fragment specifiers will keep changing over time as more and more recovery code gets added. This breaks Rust's backward compatibility guarantees! Macro calls that used to compile at some fixed point in time might potentially no longer compile in a future version of the compiler. In fact, backward compatibility has already been broken multiple times in the past without notice by (some) PRs introducing more error recovery.

Examples

There might be many more cases than listed below but it takes a lot of time experimenting and looking through the parser. I'll try to extend the list over time.

Expressions

macro_rules! check {
    ($e:expr) => {};

    // or `===`, `!==`, `<>`, `<=>`, `or` instead of `and`
    ($a:literal and $b:literal) => {};
    ($a:literal AND $b:literal) => {};

    (not $a:literal) => {};
    (NOT $a:literal) => {};
}

check! { 0 AND 1 } // passes (all good)
check! { 0 and 1 } // ⚠️ FAILS but should pass! "`and` is not a logical operator"

check! { NOT 1 } // passes (all good)
check! { not 1 } // ⚠️ FAILS but should pass! "unexpected `1` after identifier"
stderr ``` error: `and` is not a logical operator --> src/lib.rs:13:12 | 13 | check! { 0 and 1 } // ⚠️ FAILS but should pass! "invalid comparison operator" | ^^^ help: use `&&` to perform logical conjunction | = note: unlike in e.g., Python and PHP, `&&` and `||` are used for logical operators error: unexpected `1` after identifier --> src/lib.rs:16:14 | 16 | check! { not 1 } // ⚠️ FAILS but should pass! "unexpected `1` after identifier" | ----^ | | | help: use `!` to perform bitwise not ```

Statements

macro_rules! check {
    ($s:stmt) => {};

    // or `auto` instead of `var`
    (var $i:ident) => {};
    (VAR $i:ident) => {};
}

check! { VAR x } // passes (all good)
check! { var x } // ⚠️ FAILS but should pass! "invalid variable declaration" (+ ICE, #103529)
stderr ``` error: invalid variable declaration --> src/lib.rs:10:10 | 10 | check! { var x } // ⚠️ FAILS but should pass! "invalid variable declaration" (+ ICE, #103529) | ^^^ | help: write `let` instead of `var` to introduce a new variable | 10 | check! { let x } // ⚠️ FAILS but should pass! "invalid variable declaration" (+ ICE, #103529) | ~~~ ```

Other Fragments (e.g. Items, Types)

[no known cases at the time of this writing (active search ongoing)]

Editorial Notes

editorial notes I used to list some more cases above (which some of you might have noticed) but I've since removed them as they've turned out to be incorrect. Here they are: ```rust macro_rules! check { ($e:expr) => {}; ($a:literal++) => {}; ($a:literal@@) => {}; ($n:ident : $e:expr) => {}; } check! { 1@@ } // passes (all good) check! { 1++ } // FAILS but would fail ANYWAY! "Rust has no postfix increment operator" // ^ without the recovery code, we would try to parse a binary expression (operator `+`) and fail at the 2nd `+` check! { struct : loop {} } // passes (all good) check! { main : loop {} } // FAILS but would fail ANYWAY! "malformed loop label" // ^ without the recovery code, we would try to parse type ascription and fail at `loop {}` which isn't a type ```

Related issue: #90256 (concerning procedural macros).

@rustbot label A-macros A-diagnostics A-parser T-compiler

compiler-errors commented 1 year ago

We probably should be a lot more conservative with recoveries in macros. This reminds me of #103224.

Noratrieb commented 1 year ago

I will put up a PR to add some more info the parser to make these checks easier in the future.

compiler-errors commented 1 year ago

@nilstrieb and I discussed this a bit.

Nils will add a flag to the parser state to identify when we want recovery to be enabled, and only do recovery when this is set to true. Then all parser recoveries that exist currently should eventually be gated behind this flag, so that errors are properly bubbled up within macros, so that we call fall ahead to subsequent matcher branches.

For example, given this:

macro_rules! blah {
    ($expr:expr) => {};
    (not $expr:expr) => {};
}

fn main() {
    blah!(not 1);
}

We should not eagerly recover not as ! when parsing not 1, at least not until no other match arms correctly match the macro input.

In the case that no arms match successfully, we can retry all of the match arms with recovery enabled, which allows something like this:

macro_rules! blah {
    ($expr:expr) => {};
    (let) => {};
}

fn main() {
    blah!(not 1);
}

... to continue to give us a useful error message:

error: unexpected `1` after identifier
 --> <source>:7:15
  |
7 |     blah!(not 1);
  |           ----^
  |           |
  |           help: use `!` to perform bitwise not

error: aborting due to previous error
Noratrieb commented 1 year ago

This brings up a fun question. Should this code compile?

macro_rules! what {
    ($e:expr) => { compile_error!("no") };
    ($($tt:tt)*) => {};
}

fn wh() {
    what! {
        match 1;
    }
}

My knowledge about macros (and the two sentences the reference says about this) indicates that this should work. The first arm fails, but the second matches. But this doesn't compile, as the parser .emit()s a diagnostic on this broken code (via some cool lookahead and recovery).

So we not only have to gate all recovery directly, but also all emissions of diagnostics behind the flag. There are currently 165 references to .emit() in rustc_parse, so that's gonna be some work. But I believe that this is worth it, as the current behavior just makes no sense. Given enough awareness around compiler contributors, I think we can make sure that no new recovery is added without gating it behind the flag.

compiler-errors commented 1 year ago

I believe it should be compile, and that my changes described above and discussed in your comment ("So we not only have to gate all recovery directly, but also all emissions of diagnostics behind the flag.") would be sufficient.

Perhaps it's worthwhile to nominate this for lang team discussion, since that code you linked @nilstrieb hasn't compiled since before 1.0 :sweat_smile:

compiler-errors commented 1 year ago

Question for the lang team:

  1. Should this code compile?
    
    macro_rules! what {
    ($e:expr) => { compile_error!("no") };
    ($($tt:tt)*) => {};
    }

fn wh() { what! { match 1; } }


2. Is it generally okay to "pare back" existing recovery behavior to make sure that something like this compiles?
```rust
macro_rules! blah {
    ($expr:expr) => {};
    (not $expr:expr) => {};
}

fn main() {
    blah!(not 1);
}

--

This may even be more of a "implementation" question than a lang question, to perhaps this nomination is moot.

scottmcm commented 1 year ago

I think the core problem is that macro_rules are insufficiently future-proofed, leading to other problems too, like #86730.

My ideal solution (if we can find a way to do it without breaking the world) would be to make the second arm in

macro_rules! check {
    ($e:expr) => {};
    (not $a:literal) => {};
}

be useless, as anything it could potentially match would go into the first arm instead, and then produce an error if it's not actually a valid expr under the relevant edition rules. (TBH I don't even know if $e:expr here uses the macro definition's edition or the caller's edition when parsing.) This would probably

Otherwise for anything we try to add to the expr grammar there's an example like this that's a breaking change.

After all, if people really want special handing for something, it would be better for them to put that first, like

macro_rules! check {
    (not $a:literal) => {};
    ($e:expr) => {};
}

so that if one day we did add not false as supported syntax, it would keep working.

(Maybe that implies that for each macro matcher we'd have some "start set", like can_begin_expr? And then it's only adding something new to the matcher's start set that would be a breaking change, for which we'd need an edition change or a new matcher.)

Noratrieb commented 1 year ago

The matching logic uses the starting set already (presumably as a fast path) EDIT: Not just as a fast path, it's important semantically

fmease commented 1 year ago

My ideal solution (if we can find a way to do it without breaking the world) would be to make the second arm in […] useless

We could definitely do that for decl macros 2.0 since they are unstable. For decl macros 1.2, we could start issuing future incompat warnings.

I wonder if that'd make them less powerful.

Noratrieb commented 1 year ago

I did some more investigation on this. This thread contains quite some confusion about the actually implemented matching semantics (mostly because of myself) but I think I can now clean this up.

The currently implemented macro matching semantics

One arm is tried after another in declaration order. If matching or an arm fails with the Failure error, the next arm is tried. When no more arms are left, the macro matching failed and an error is reported. If an arm fails with the Error/ErrorReported error (the difference between the two is an implementation detail and does not matter here), the macro matching is considered failed, an error is reported and no further arms are tried.

When encountering a nonterminal specifier, a check is done whether the first token can start this nonterminal kind. If it cannot, the nonterminal is not entered (which can either lead to a Failure later as the matcher is out of matchers, or work just fine, for example with $( $_:literal)? struct matching struct). If the first token can start this nonterminal, the parser is invoked to parse the current tokens as that nonterminal.

The parser will then parse as many tokens as necessary into that nonterminal. Importantly, it may not consume all input if it deems the further grammar to be invalid. For example, NOT 1 is successfully parsed as the NOT identifier, not consuming the 1 yet.

If the parser fails to parse the nonterminal, a fatal error is emitted and ErrorReported is returned: https://github.com/rust-lang/rust/blob/0a6b941df354c59b546ec4c0d27f2b9b0cb1162c/compiler/rustc_expand/src/mbe/macro_parser.rs#L613-L621

If, after all matchers have been exhausted, there are still tokens around in the input, the matching is considered a Failure. This also applies the other way around, if there are matchers remaining but all tokens consumed.

With these semantics, this code should indeed not compile, as the expression grammar requires a { after the scrutinee expression:

macro_rules! what {
    ($e:expr) => { compile_error!("no") };
    ($($tt:tt)*) => {};
}

what! {
    match 1;
}

On the other hand, this code works just fine:

macro_rules! check {
    ($e:expr) => {};
    (NOT $a:literal) => {};
}

check! { NOT 1 }

The expression parser is invoked with NOT 1, because an identifier can start an expression. It then parses the NOT identifier as an identifier. It looks ahead to see whether the expression continues, but a literal cannot follow after an expression, so it deems its parsing to be completed and returns a successful result, leaving 1 behind in the token stream. After the expression, the matchers are over, but there are still tokens remaining in the input stream. This is a Failure error condition, so it goes to the next arm, which then matches successfully.

Where this goes wrong

macro_rules! check {
    ($e:expr) => {};
    (not $a:literal) => {};
}

check! { not 1 }

This should be equivalent to NOT example. But this is where the parser recovery comes in. After successfully parsing a not identifier, the parser eagerly looks at the next token to see whether it can start an expression. If it can, report and error and recover into !expr. In the correct Rust grammar, this can never happen in valid code, so this is fully legal for it to do normally. https://github.com/rust-lang/rust/blob/0a6b941df354c59b546ec4c0d27f2b9b0cb1162c/compiler/rustc_parse/src/parser/expr.rs#L618-L620

But this breaks the semantics described above. During nonterminal parsing, the parser must not consume more tokens than necessary, as its early return while leaving tokens behind is an important part of matching nonterminals.

What now

Based on all of this, I have two conclusions. Firstly, I believe that the current semantics make sense and should continue working like this. I would still prefer this being discussed in the lang team anyways just to make sure.

Also, it is clear that doing eager recovery by consuming more tokens than necessary should not happen in the nonterminal parser, ever. So the effort started by #103544 should continue, as this is clearly a bug.

joshtriplett commented 1 year ago

We discussed this in today's @rust-lang/lang meeting.

We'd like both of these examples to work. Making the one with compile_error work seems like it'll take more work, and may need a more detailed proposal. Making the one with not work seems like a bugfix that lang doesn't need to review and approve.

fmease commented 1 year ago

I wonder if we should take a different approach for fixing this issue instead of using .may_recover(). Bear in mind that I am not 100% familiar with the parser & macro matching internals, so the following idea might not be workable.

What about changing PResult / adding a flag to the parser to keep track of whether a recovery occured. If it did then the token stream would not actually match a given macro matcher (hmm, but that would probably mean we need to backtrack somehow, unclear). This might introduce a performance penalty.

This way, we can still show a custom error message without making the syntactically malformed input well-formed. Right now for example, m! { fn<>() } (where m expects a ty) yields the error expected ( found < instead of function pointer types may not have generic parameters (#104223) as we use .may_recover() for this case.

Just throwing my thoughts out there.

Noratrieb commented 1 year ago

In some cases this may work, but not always. Looking at the not 0 example, the parser needs to be edited. The parser looked ahead and eagerly recovers, emits an error and returns success. So even if the matcher then knew that recovery occurred (and the parser would somehow need to keep track of that, so we'd still need to annotate recovery) the session has the emitted error anyways.

But I agree that may_recover isn't great, but I haven't found anything better yet.