llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29k stars 11.95k forks source link

[Clang] `\n\r` newlines #101513

Open MitalAshok opened 3 months ago

MitalAshok commented 3 months ago

Previous discussion about this can be found here: https://github.com/llvm/llvm-project/pull/97585/files#r1674111174

Clang currently accepts \n\r as a single new-line. I.e., a \ followed by the two characters \n\r deletes all three of those characters in the physical source to create one logical source line.

CWG2639 seems to make it so \r\n -> \n and \r(not followed by \n) -> \n, but Clang also converts \n\r->\n. This seems nonconformant in C++23.

GCC does not treat \n\r as a single new-line. MSVC does seem to treat \n\r as a single new-line.

shafik commented 3 months ago

Can you add a godbolt link for each example to clarify the behavior, I am not sure I follow the whole issue correctly.

shafik commented 3 months ago

CC @cor3ntin @tahonermann

llvmbot commented 3 months ago

@llvm/issue-subscribers-clang-frontend

Author: Mital Ashok (MitalAshok)

Previous discussion about this can be found here: https://github.com/llvm/llvm-project/pull/97585/files#r1674111174 Clang currently accepts `\n\r` as a single new-line. I.e., a `\` followed by the two characters `\n\r` deletes all three of those characters in the physical source to create one logical source line. [CWG2639](https://cplusplus.github.io/CWG/issues/2639.html) seems to make it so `\r\n` -> `\n` and `\r`(not followed by `\n`) -> `\n`, but Clang also converts `\n\r`->`\n`. This seems nonconformant in C++23. GCC does not treat `\n\r` as a single new-line. MSVC *does* seem to treat `\n\r` as a single new-line.
MitalAshok commented 3 months ago

I could not get \r carriage-returns to work on godbolt (they are just replaced with \n), so I can't show it there.

I'll rewrite the example in the original thread:

int main() { // this line ends with \n\r\
 return 1;
}

Generated with:

python3 -c 'open("test.cpp", "wb").write(b"int main() { // this line ends with \\n\\r\\\n\r return 1;\n}\n")'

Clang and MSVC treat the return 1; as part of the comment on a single line (so it returns 0) and GCC doesn't (so it returns 1).

This is one of the few cases where multiple vs a single new line matters.

(Another case is https://cplusplus.github.io/CWG/issues/1709.html / cebac48bf7e52e352b8cda806a64dab66df4c64f for how many \ n strings are produced when stringizing a raw string)

MitalAshok commented 3 months ago

There looks to be a similar bug in raw string literal parsing:

constexpr const char* s = R"(
)";

With the newline = \n:

Compiler s[0] s[1]
Clang '\n' 0
GCC '\n' 0
MSVC '\n' 0

With the newline = \r\n:

Compiler s[0] s[1]
Clang '\n' 0
GCC '\n' 0
MSVC '\n' 0

With the newline = \r:

Compiler s[0] s[1]
Clang '\r' 0
GCC '\n' 0
MSVC '\n' 0

(MSVC also has "warning C4335: Mac file format detected: please convert the source file to either DOS or UNIX format")

With the newline = \n\r:

Compiler s[0] s[1] s[2]
Clang '\n' '\r' 0
GCC '\n' '\n' 0
MSVC '\n' 0

GCC seems to have the correct behaviour on all of them

shafik commented 3 months ago

@AaronBallman points out this piece of code: https://github.com/llvm/llvm-project/blob/e46468407a7bb7f8b2fe13675a5a1c32b85f8cad/clang/lib/Lex/Lexer.cpp#L1288-L1293