rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.09k stars 12.55k forks source link

Inconsistent whitespace definitions in string literals and language itself #60209

Open matklad opened 5 years ago

matklad commented 5 years ago

Lexer uses Pattern_White_Space unicode property when skipping over trivia. However, when we process string literals with escaped newlines, we only skip ASCII whitespace:

https://github.com/rust-lang/rust/blob/fe0a415b4ba3310c2263f07e0253e2434310299c/src/libsyntax/parse/mod.rs#L379

Here's an example program that shows that U+200F is ignored in program text, but not in the string literal

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=ec59778d31dde69f29f1095aff2c9b66

Here's the text of the program in Debug format, to make whitespace slightly more visible

"fn main() {\n\u{200f}\u{200f}\u{200f}\n    let s = \"\\\n\u{200f}\u{200f}\u{200f}hello\n\";\n    println!(\"{:?}\", s);\n}    \n"
Centril commented 5 years ago

Tentatively classifying as a bug.

matklad commented 5 years ago

I am pretty ignorant about unicode, but I would prefer to fix this the other way around, by restricting whitespace definition in the reference to ASCII. Opened https://internals.rust-lang.org/t/do-we-need-unicode-whitespace/9876 for that discussion

estebank commented 5 years ago

cc @Manishearth

Manishearth commented 5 years ago

200F is definitely useful in text, we should not be skipping it at all in strings.

As for the lexer: this was one of the questions we punted for later on the non ascii ident story: RLM is useful for having code using RTL scripts that renders well, so having it as allowed whitespace is somewhat useful (if confusing)