rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.61k stars 12.74k forks source link

Create an internal lint for detecting "Unicode-unaware" `BytePos` & `Span` manipulations #128790

Open fmease opened 3 months ago

fmease commented 3 months ago

Inspired by https://github.com/rust-lang/rust/issues/128717#issuecomment-2270315205. CC @jieyouxu.


Since we recover from lexically invalid tokens that are Unicode-confusable with tokens that are lexically valid (e.g., U+037E Greek Question MarkU+003B Semicolon; U+066B Arabic Decimal SeparatorU+002C Comma), (suggestion) diagnostic code down the line generally ought not make too many assumptions about the length and/or position in bytes that the Span of a supposed Rust token/lexeme "maps to" .

In reality however, all too often (suggestion) diagnostic code doesn't follow this 'rule' when performing low-level "span manipulations" defaulting to hard-coded lengths and/or positions. The compiler contains a bunch of snippets like - BytePos(1) or + BytePos(1) where the code guesses that the code point before/after corresponds to a certain token/lexeme like ,, ;, ). However, such code doesn't account for the aforesaid recovery which may have mapped a UTF-8 code unit with byte length > 1 to an ASCII character of length 1 which can lead to ICEs (internal assertions or indexing/slicing at non-char boundaries).

sigh we have too many hard coded +1/-1 in the compiler -- @compiler-errors


So it might be worth linting against these error-prone BytePos & Span manipulations. I don't know how feasible it'd be to implement such a lint well (i.e., low false positive rate) or how the exact rules should look like.

This issue may serve a dual purpose as a tracking issue for eliminating this 'pattern' from the code base.


Uplifted from https://github.com/rust-lang/rust/issues/128717#issuecomment-2273735687:

It's so easy to find these kinds of ICEs:

  1. One simply needs to look for /(\+|-) BytePos\(\d+\)/ inside compiler/,
  2. Figure out which ASCII character is meant
  3. Open one's favorite Unicode table website or program that can list Unicode-confusables
  4. Pick a confusable whose .len_utf8() is >1 and :boom:

Example ICE: #128717.

Example ICE: I just found this a minute ago while reviewing an unrelated PR:

This code uses a Medium Right Parenthesis Ornament (U+2769) which is confusable with Right Parenthesis.

fn f() {}

fn main() {
    f(0,1❩;
}

Leads to:

thread 'rustc' panicked at compiler/rustc_span/src/lib.rs:2119:17:
assertion failed: bpos.to_u32() >= mbc.pos.to_u32() + mbc.bytes as u32

Code in question:

https://github.com/rust-lang/rust/blob/c9687a95a602091777e28703aa5abf20f1ce1797/compiler/rustc_hir_typeck/src/fn_ctxt/checks.rs#L1144

Discussions

Related issues

jieyouxu commented 2 months ago

As for suggestions:

compiler-errors commented 2 months ago

Alternatively, if suggestions are often finding that a span needs to be recreated to match some subset of the AST (e.g. a keyword's span like async needs to be tracked for suggestions downstream), consider modifying the AST directly to track that span. But please only do this if it's generally useful, and not just needed for a single suggestion, and also be sure to audit the code for existing suggestions to simplify.