Open fmease opened 3 months ago
As for suggestions:
Span
, like Span::{to, between, until}
etc.SourceMap
: SourceMap::{start_point, end_point, next_point, span_through_char, span_until_whitespace, span_until_non_whitespace, span_until_char, find_width_of_character_at_span}
.BytePos
manipulations need to carefully consider if a parser recovery scheme can reach the suggestion, or in the case of string literals and macros, can the user provide input that would cause codepoint boundary issues.Alternatively, if suggestions are often finding that a span needs to be recreated to match some subset of the AST (e.g. a keyword's span like async
needs to be tracked for suggestions downstream), consider modifying the AST directly to track that span. But please only do this if it's generally useful, and not just needed for a single suggestion, and also be sure to audit the code for existing suggestions to simplify.
Inspired by https://github.com/rust-lang/rust/issues/128717#issuecomment-2270315205. CC @jieyouxu.
Since we recover from lexically invalid tokens that are Unicode-confusable with tokens that are lexically valid (e.g., U+037E Greek Question Mark → U+003B Semicolon; U+066B Arabic Decimal Separator → U+002C Comma), (suggestion) diagnostic code down the line generally ought not make too many assumptions about the length and/or position in bytes that the
Span
of a supposed Rust token/lexeme "maps to" .In reality however, all too often (suggestion) diagnostic code doesn't follow this 'rule' when performing low-level "span manipulations" defaulting to hard-coded lengths and/or positions. The compiler contains a bunch of snippets like
- BytePos(1)
or+ BytePos(1)
where the code guesses that the code point before/after corresponds to a certain token/lexeme like,
,;
,)
. However, such code doesn't account for the aforesaid recovery which may have mapped a UTF-8 code unit with byte length > 1 to an ASCII character of length 1 which can lead to ICEs (internal assertions or indexing/slicing at non-char boundaries).So it might be worth linting against these error-prone
BytePos
&Span
manipulations. I don't know how feasible it'd be to implement such a lint well (i.e., low false positive rate) or how the exact rules should look like.This issue may serve a dual purpose as a tracking issue for eliminating this 'pattern' from the code base.
Uplifted from https://github.com/rust-lang/rust/issues/128717#issuecomment-2273735687:
It's so easy to find these kinds of ICEs:
/(\+|-) BytePos\(\d+\)/
insidecompiler/
,.len_utf8()
is >1 and :boom:Example ICE: #128717.
Example ICE: I just found this a minute ago while reviewing an unrelated PR:
This code uses a Medium Right Parenthesis Ornament (U+2769) which is confusable with Right Parenthesis.
Leads to:
Code in question:
https://github.com/rust-lang/rust/blob/c9687a95a602091777e28703aa5abf20f1ce1797/compiler/rustc_hir_typeck/src/fn_ctxt/checks.rs#L1144
Discussions
Related issues