rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.87k stars 12.67k forks source link

"-NUM.method()" is parsed in surprising ways #117155

Closed RalfJung closed 2 months ago

RalfJung commented 12 months ago

This came up here but I am pretty sure I ran into this before: Rust parses -5.5f32.method() as -(5.5f32.method()). I don't know if that is the parsing result that anyone expects, I certainly would never expect it -5.5f32 is a float literals, and that is what I want to apply the method to, I never even considered that the compiler might interpret this as a unary minus applied to the method result.

Assuming that we cannot change the parser, I think we should spend some effort defending against the possible confusion here. For instance:

jhorstmann commented 12 months ago

This also made me question my sanity some time ago, a lint would have been very helpful. The code in question was

assert_eq!(1_i32, -1_i32.rem_euclid(12)) // should be (-i_i32)
clarfonthey commented 12 months ago

I would personally be very surprised if -x.method() and -1.method() parsed differently, and it feels proper to me that -(x.method()) would be the way it's parsed.

This matches similar precedence rules in maths, like how -2x is -(2x).

Another good test is if adding 0 to the beginning would change the result: 0-x.method() shouldn't parse as (0-x).method(), so why should -x.method() be different?

RalfJung commented 12 months ago

In math I would insist on parentheses since -2^x looks ambiguous.

Another good test is if adding 0 to the beginning would change the result:

That's not at all how think about negative integer literals. -2 is just the mathematical integer negative-two, and having -2.method() split up the literal is totally unexpected.

It's great that your intuition aligns with the current parser, but that doesn't mean that everybody's intuition aligns with the parser. I literally thought the gamma_ln function had a bug (see the comment linked in the OP); it didn't even occur to me that the Rust parser could dis-associate the - from my integer literal -- that possibility was so outlandish it didn't enter my mind. So clearly this is ambiguous, and the right thing to do is add parentheses.

chfogelman commented 12 months ago

I too would be very surprised if it was not parsed the way it currently is. If -5.method() were parsed as (-5).method(), then how would we expect - 5.method() to be parsed? You either end up having very unintuitive behavior, where - 5.method() is parsed as (-5).method(), or you end up in a situation where whitespace affects the way an expression is parsed.

To my knowledge, every language with both unary minus and a . operator does exactly what Rust currently does here.

I also don't think it's fair to say that the compiler is disassociating the - from the literal. The lexer has always treated - as a separate token from whatever follows it.

RalfJung commented 12 months ago

It doesn't matter what the lexer does internally. For many humans reading and writing code, "-5" is an atom that cannot be split meaningfully. If the lexer doesn't work like that then it is up to us to minimize the fallout and confusion stemming from that decision. This is true even if the alternative parsing would be just as surprising in other contexts. Thay just means we should steer the user towards code that is unambiguous.

chfogelman commented 12 months ago

Even ignoring what the lexer does, my point is that for many other humans, especially those who work in mathematics or computing, -5 is a sequence of two atoms that obey standard operator precedence and just happen not to have a space between them. For those of us in that group, anything other than the current behavior would be confusing. I don't necessarily disagree that it can create confusion, though I do think that changing the parser would have been a step too far.

As a side note, rustfmt currently reformats let _ = - 5.method() to let _ = -5.method(). Should this be another angle by which we can reduce confusion?

Ghabriel commented 12 months ago

The way it's currently being parsed feels very natural and correct to me as well. It not only matches similar rules in math as pointed out earlier but is also how it works in every language that allows an equivalent syntax. Adding extra parentheses in situations like this would just feel like visual noise to me, personally.

clarfonthey commented 12 months ago

Note that there are cases where a negative literal is treated as a fused literal, in particular in cases with the literal macro matcher. In these cases, it was weird that 1.0 would be accepted but not -1.0, which makes sense from a syntax perspective but not from a user perspective.

I don't really think that any of the proposed suggestions for negation really work well, though. Adding an extra space looks like a typo, and adding extra parentheses also makes rewriting code more obnoxious since it's a non-local change (adding a negative sign requires adding a closing parenthesis at the end of the expression).

Even though the precedence rules are well-motivated, parentheses exist regardless and Rust even has lints that encourage using them even in unambiguous situations, like converting x && y || z to (x && y) || z.

So, I do think that at minimum there could be an allow-by-default lint suggesting one or more of these methods, and/or style rules for rustfmt.

RalfJung commented 12 months ago

Even ignoring what the lexer does, my point is that for many other humans, especially those who work in mathematics or computing, -5 is a sequence of two atoms that obey standard operator precedence and just happen not to have a space between them.

As a data point, I work in computing. Programming languages are literally my job.

I guess we'd need a study to see how many "many" is. :shrug:

though I do think that changing the parser would have been a step too far.

It's not realistic to change the parser so we agree on the outcome here, if not on the reasons.

It not only matches similar rules in math

I'm not aware of math having well-defined operator precedence for cases like this. Do you have a source for this claim?

Even though the precedence rules are well-motivated, parentheses exist regardless and Rust even has lints that encourage using them even in unambiguous situations, like converting x && y || z to (x && y) || z.

Yeah, that's why I was thinking a lint would make sense here. FWIW those cases are ambiguous if you can't remember whether "and" or "or" binds stronger. And in languages like bash, they actually bind equally strong (and are left-associative), so this is inconsistent across languages.

I don't think it is realistic for everyone to remember all the precedence rules, so I agree with the compiler that the parentheses should be added here to avoid ambiguity.

jdahlstrom commented 12 months ago

for many other humans, especially those who work in mathematics or computing, -5 is a sequence of two atoms

I doubt that many mathematicians would consider "-5" anything but a single unit, the negative integer that is the additive inverse of 5, just like they wouldn't consider "½" a division but the rational number that's the multiplicative inverse of 2.

chfogelman commented 12 months ago

Note that there are cases where a negative literal is treated as a fused literal, in particular in cases with the literal macro matcher.

Interestingly, $lit:literal will even match unfused literals like - 1. The parser follows standard operator precedence, and only then raises negated literal expressions to be literals themselves.

As someone who can't imagine this expression being parsed any other way, I'm curious how everyone else views whitespace. My assumption has always been that I can add arbitrary whitespace around any operator and get the same results. So if you find the way this expression is parsed confusing, how would you mentally parse the following expressions?

-1.method()
- 1.method()
-x.method()
- x.method()

clarfonthey commented 12 months ago

It not only matches similar rules in math

I'm not aware of math having well-defined operator precedence for cases like this. Do you have a source for this claim?

Even though this wasn't directed at me, I figured I'd say that my primary argument is based upon the exponent rules (which always bind exponents before unary minus in mathematics), and the fact that function calls are generally put at the highest precedence above even that. Note that these rules are variable, for example, some programs like Excel will bind the unary minus before exponents, but I consider this unnatural since it goes against standard mathematics. In primary school, we were explicitly taught that expressions like -2x were a potentially confusing case under these rules, although if you consider unary minus the same as any other arithmetic operator, it makes sense that it would have lower precedence.

Function calls having the highest precedence has also been the standard since C, and I would imagine that methods would have similarly high precedence. Since unary minus doesn't have to apply to just numeric literals, I would always imagine it to be parsed separately, since as shown, the precedence of -1.method() even mattering would depend on - being lexed as a separate operator.

not-an-aardvark commented 12 months ago

From the peanut gallery: there's an additional complication for minimum signed integer values:

chfogelman commented 12 months ago

This is not a result of applying the unary minus operator to the literal 128i8, which would be ill-formed because 128i8 is out of range. Intuitively, this suggests that -128i8 is interpreted as a single literal.

The parser parses everything according to standard precedence rules, treating - as unary minus. So this gets parsed as something like Unary(Neg, Literal("128i8")). The compiler then treats this entire unit as a literal for purposes of overflow checks and such. It's not that -128_i8 is interpreted as a single literal; it's that Unary(Neg, Literal(_)) is treated as a sort of "honorary literal" after parsing. This is why let x = - 128i8; doesn't result in an error either.

However, a method call like -128i8.pow(1) is parsed as -((128i8).pow(1)), resulting in an out-of-range error.

How does your brain parse - 128i8.pow(1)?

It seems bizarre for a literal to become out of range as a result of adding a method call.

You've added another operator (.) with a higher associativity, which changes what the - is being applied to. As an analogy, 2 + 2 is 4, but 2 + 2 * 3 is not 12. Adding the * 3 changes what 2 + is being applied to.

clarfonthey commented 12 months ago

However, a method call like -128i8.pow(1) is parsed as -((128i8).pow(1)), resulting in an out-of-range error.

While this makes sense, it's certainly surprising and I never thought of this case. I definitely agree that situations like this should have a warn-by-default lint at minimum, and maybe there can be a specific list of methods where this particular parsing rule has "weird" results, like this one.

chfogelman commented 12 months ago

maybe there can be a specific list of methods where this particular parsing rule has "weird" results, like this one.

To clarify, the method being called doesn't matter here. It's a compile-time error because the compiler can't create an 128_i8 to call the method on. -128_i8.to_string() results in the same error.

RalfJung commented 12 months ago

My assumption has always been that I can add arbitrary whitespace around any operator and get the same results. So if you find the way this expression is parsed confusing, how would you mentally parse the following expressions?

My intuition parses -5.method() different from - 5.method(). This makes sense since -5 is the atom "integer: negative five", while - 5 is unary minus applied to 5. Those two terms evaluate to the same result but they are not the same term. (Yes I do realize that's not how the Rust parser works.)

I'd say that my primary argument is based upon the exponent rules (which always bind exponents before unary minus in mathematics),

Yeah you mentioned that above. I wasn't aware of this rule, but Wikipedia confirms. Wikipedia also lists some exceptions where the rule does not apply, though.

Anyway, while we can go on discussing which version is "more intuitive", that's not quite the point of this issue. The point of this issue is that the current parsing is ambiguous and surprising to some people. There's two outcomes here:

  1. this is ambiguous for so few people that it's not worth doing anything about
  2. this is ambiguous for sufficiently many people that we should do something about it (a lint and/or rustfmt rule)

I prefer outcome 2, but I don't have the data to prove that it's ambiguous for "sufficiently many" people. In this issue we seem to have a fairly even split, but of course this is far from representative.

Function calls having the highest precedence has also been the standard since C,

C doesn't have method calls so the issue can't even arise there? But I take it this means Java and C++ parse this like Rust. That's a good argument to keep the parser as-is, but it's not an argument against a lint IMO.

clarfonthey commented 12 months ago

C actually does have method calls since you can put function pointers in structs, although since function application has a higher precedence than field access, you need to add parentheses anyway. Although -(x.y)() would still be parsed as -((x.y)()).

I should clarify my position here: I'm fully in support of adding rules to both rustfmt and rust, it's mostly whether they'd be enabled by default that is my particular concern.

RalfJung commented 12 months ago

I should clarify my position here: I'm fully in support of adding rules to both rustfmt and rust, it's mostly whether they'd be enabled by default that is my particular concern.

Thanks for clarifying. I do think the lint should be warn-by-default, since otherwise it won't actually help people that run into this without realizing the possibility of running into this (like my past self).

traviscross commented 12 months ago

How languages handle exponentiation is a close analogy. If a language were to treat -2 as a single atom, then -2**2 would equal 4.

However, in Fortran (1957-), Perl, Python, and Ruby, -2**2 == -4.

In Haskell, -2^2 == -4 (and -2**2 == -4.0).

And in Rust, assert_eq!(-2i8.pow(2), -4).

Linting may seem somewhat opinionated given the precedent and history.

We should also think about symmetry with the other unary operator, !. Should these lint?

#[test]
fn test() {
    assert_eq!(127, !2u8.pow(7));
    assert_eq!(false, !2u8.pow(7).is_power_of_two());
}

(I would not expect linting here. But I would expect consistency with - in any case.)

traviscross commented 12 months ago

Mathematically, adding a negative sign is taught as being exactly equivalent to a multiplication by negative one. That's at least one justification for why $-2^2 = -4$ in common notation.

RalfJung commented 12 months ago

Linting may seem somewhat opinionated given the precedent and history.

And how may people programming in these languages know this rule? In a Python code review I would definitely insist on adding parentheses to -2**2.

Should these lint?

No. These are clearly not literals.

But I would expect consistency with - in any case.

! and - are very different though. - exists as part of the syntax for literals like -5, unlike !. Even the rustc parser recognizes this difference in its treatment for -128u8.

traviscross commented 12 months ago

As @chfogelman described, the rustc parser treats -128i8 in an ordinary manner, but later stages of the compiler then treat it specially for overflow checks. Note that -(128i8) also works.

chfogelman commented 12 months ago

I definitely support warn-by-default. Parsing is invisible to the user, so if you don't know that operator precedence is the issue, it'll take a very long time to figure out.

I don't think ! should be linted on. Unlike -, prefix ! doesn't have baggage from mathematics or everyday life.

Note that -(128i8) also works.

Yep, and so does (((((-(((((128i8)))))))))). The AST doesn't carry any information about parentheses.

RalfJung commented 12 months ago

As @chfogelman described, the rustc parser treats -128i8 in an ordinary manner, but later stages of the compiler then treat it specially for overflow checks. Note that -(128i8) also works.

It's not just overflow checks. It's part of turning the tokens into an AST. -128 becomes a constant with value -128, !128 becomes the unary operator ! applied to the constant 128. So there's a fundamental difference between - and ! baked pretty deep into the rust parser (or a later stage during AST/HIR/MIR construction).

traviscross commented 11 months ago

As @chfogelman described, the rustc parser treats -128i8 in an ordinary manner, but later stages of the compiler then treat it specially for overflow checks. Note that -(128i8) also works.

It's not just overflow checks. It's part of turning the tokens into an AST. -128 becomes a constant with value -128, !128 becomes the unary operator ! applied to the constant 128. So there's a fundamental difference between - and ! baked pretty deep into the rust parser (or a later stage during AST/HIR/MIR construction).

At the level of AST generation, the literal is always 128i8, regardless of whether the input is -128i8 or !128i8. The AST for these is exactly the same except that Neg is replaced by Not.

Proof We can see this with `rustc +nightly -Z unpretty=ast-tree`. ```rust fn main() { _ = -128i8; } ``` Becomes: ```rust Expr { kind: Unary( Neg, Expr { kind: Lit( Lit { kind: Integer, symbol: "128", suffix: Some( "i8", ), }, ), attrs: [], tokens: None, }, ), attrs: [], tokens: None, }, ``` And this: ```rust fn main() { _ = !128i8; } ``` Becomes: ```rust Expr { kind: Unary( Not, Expr { kind: Lit( Lit { kind: Integer, symbol: "128", suffix: Some( "i8", ), }, ), attrs: [], tokens: None, }, ), }, ), attrs: [], attrs: [], tokens: None, }, ```

Here's one place where there is special handling in the compiler: compiler/rustc_lint/src/types.rs:414

fn lint_int_literal<'tcx>(
...
) {
...
    if (negative && v > max + 1) || (!negative && v > max) {
    ...
    }
}

Looking at:

fn main() {
    dbg!(!127i8);
    dbg!(-128i8);
}

In the MIR, these both end up as const i8::MIN in release mode. In the HIR, these are both still just !127i8 or -128i8.

chfogelman commented 11 months ago

@traviscross -128i8 starts being treated as a fused literal during MIR lowering:

https://github.com/rust-lang/rust/blob/2f1bd0729b74787f55d4cbc7818cfd787cd43a99/compiler/rustc_mir_build/src/thir/cx/expr.rs#L503-L512

Basically, nothing in the entire compiler treats !127 as a literal, because it doesn't have to. - needs special treatment because it's used to form i*::MIN literals, but ! never forms part of a literal, not in the AST, not in the HIR, not in everyday life.

I can get behind linting -1.method() because of this special treatment of -, but since ! doesn't get any special treatment, I don't see the point in linting on !1.method().

traviscross commented 11 months ago

@chfogelman: Interesting. Thanks for pointing to where that happens during MIR lowering.

Agreed ! doesn't get any special treatment because mathematically it doesn't need it. The question is whether that special treatment, motivated by the desire to make one number per type work in a convenient way, is enough to justify different treatment for the two otherwise similar unary operators. It's an interesting question to consider.

clarfonthey commented 11 months ago

I think that no matter what here, the solution shouldn't be motivated by what the parser actually does here, and instead by what people expect the parser should do. The reason why I personally would rather not format -1.method() differently is because I expect it to work a certain way and what it does matches my expectations.

However, my expectations are not everyone else's expectations, and that's reason to look into this. Explaining what the parser does here ultimately doesn't make a difference because we're not talking about the current implementation, but how we expect the parser to act regardless of implementation.

ankane commented 11 months ago

fwiw, Ruby parses it the way that's expected in the original issue (different from Rust):

-2.abs
#=> 4

x = 2
-x.abs
#=> -4