llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.97k stars 11.54k forks source link

[clang] Invalid character/string literals accepted in C++ during preprocessing after P2621R2. #107191

Open keinflue opened 1 week ago

keinflue commented 1 week ago

After P2621R2, which is a defect report, the following program is ill-formed in C++ (UB beforehand):

#define X '\N'
int main() {}

This is ill-formed already in translation phase 3 when lexing into preprocessing tokens, because \N not followed by { can't be a named-universal-character and \N also can't begin any escape-sequence. Therefore '\N' can't be a character-literal and ' will form a single-character preprocessing token by itself. [lex.pptoken]/2 makes this ill-formed (UB before P2621R2).

The C++ status page claims that P2621R2 is already supported, but Clang compiles this without diagnostic (https://godbolt.org/z/WxvcfPj8a).

The same happens with other invalid escape sequences and string literals as well.

llvmbot commented 1 week ago

@llvm/issue-subscribers-clang-frontend

Author: None (keinflue)

After [P2621R2](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2621r2.pdf), which is a defect report, the following program is ill-formed in C++ (UB beforehand): ``` #define X '\N' int main() {} ``` This is ill-formed already in translation phase 3 when lexing into preprocessing tokens, because `\N` not followed by `{` can't be a _named-universal-character_ and `\N` also can't begin any _escape-sequence_. Therefore `'\N'` can't be a _character-literal_ and `'` will form a single-character preprocessing token by itself. [[lex.pptoken]/2](https://eel.is/c++draft/lex#pptoken-2.sentence-4) makes this ill-formed (UB before P2621R2). The C++ status page claims that P2621R2 is already supported, but Clang compiles this without diagnostic (https://godbolt.org/z/WxvcfPj8a). The same happens with other invalid escape sequences and string literals as well.
shafik commented 1 week ago

maybe dup: https://github.com/llvm/llvm-project/issues/97741

CC @cor3ntin

zygoloid commented 1 week ago

I think this is distinct from #97741 -- this is incorrect acceptance of invalid character and string literal tokens whose end cannot be determined due to invalid escape sequences, whereas #97741 is about failing to reject well-formed but invalid UCNs in a string literal that can be tokenized.