Create an internal lint for detecting "Unicode-unaware" `BytePos` & `Span` manipulations

fmease commented 3 months ago

Inspired by https://github.com/rust-lang/rust/issues/128717#issuecomment-2270315205. CC @jieyouxu.

Since we recover from lexically invalid tokens that are Unicode-confusable with tokens that are lexically valid (e.g., U+037E Greek Question Mark → U+003B Semicolon; U+066B Arabic Decimal Separator → U+002C Comma), (suggestion) diagnostic code down the line generally ought not make too many assumptions about the length and/or position in bytes that the Span of a supposed Rust token/lexeme "maps to" .

In reality however, all too often (suggestion) diagnostic code doesn't follow this 'rule' when performing low-level "span manipulations" defaulting to hard-coded lengths and/or positions. The compiler contains a bunch of snippets like - BytePos(1) or + BytePos(1) where the code guesses that the code point before/after corresponds to a certain token/lexeme like ,, ;, ). However, such code doesn't account for the aforesaid recovery which may have mapped a UTF-8 code unit with byte length > 1 to an ASCII character of length 1 which can lead to ICEs (internal assertions or indexing/slicing at non-char boundaries).

sigh we have too many hard coded +1/-1 in the compiler -- @compiler-errors

So it might be worth linting against these error-prone BytePos & Span manipulations. I don't know how feasible it'd be to implement such a lint well (i.e., low false positive rate) or how the exact rules should look like.

This issue may serve a dual purpose as a tracking issue for eliminating this 'pattern' from the code base.

Uplifted from https://github.com/rust-lang/rust/issues/128717#issuecomment-2273735687:

It's so easy to find these kinds of ICEs:

One simply needs to look for /(\+|-) BytePos\(\d+\)/ inside compiler/,
Figure out which ASCII character is meant
Open one's favorite Unicode table website or program that can list Unicode-confusables
Pick a confusable whose .len_utf8() is >1 and :boom:

Example ICE: #128717.

Example ICE: I just found this a minute ago while reviewing an unrelated PR:

This code uses a Medium Right Parenthesis Ornament (U+2769) which is confusable with Right Parenthesis.

fn f() {}

fn main() {
    f(0,1❩;
}

Leads to:

thread 'rustc' panicked at compiler/rustc_span/src/lib.rs:2119:17:
assertion failed: bpos.to_u32() >= mbc.pos.to_u32() + mbc.bytes as u32

Code in question:

https://github.com/rust-lang/rust/blob/c9687a95a602091777e28703aa5abf20f1ce1797/compiler/rustc_hir_typeck/src/fn_ctxt/checks.rs#L1144

Discussions

https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Internal.20lint.20for.20.22non-Unicode-aware.22.20BytePos.20manipulations.3F

Related issues

jieyouxu commented 2 months ago

As for suggestions:

Suggestions should try to derive spans from existing spans using codepoint-boundary-aware operations available on Span, like Span::{to, between, until} etc.
If suggestions really need to manipulate by codepoint granularity, consider using codepoint-boundary-aware helpers on SourceMap: SourceMap::{start_point, end_point, next_point, span_through_char, span_until_whitespace, span_until_non_whitespace, span_until_char, find_width_of_character_at_span}.
If a suggestion really, really cannot get away with constructing the desired span from previous methods, then the BytePos manipulations need to carefully consider if a parser recovery scheme can reach the suggestion, or in the case of string literals and macros, can the user provide input that would cause codepoint boundary issues.

compiler-errors commented 2 months ago

Alternatively, if suggestions are often finding that a span needs to be recreated to match some subset of the AST (e.g. a keyword's span like async needs to be tracked for suggestions downstream), consider modifying the AST directly to track that span. But please only do this if it's generally useful, and not just needed for a single suggestion, and also be sure to audit the code for existing suggestions to simplify.

rust-lang / rust

Create an internal lint for detecting "Unicode-unaware" `BytePos` & `Span` manipulations #128790

Discussions

Related issues