llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.03k stars 11.97k forks source link

Wrong treatment of unicode named characters with lowercase letters #109555

Open mrolle45 opened 1 month ago

mrolle45 commented 1 month ago

When compiling a C++ file with the C++23 standard, which introduces named universal characters, I will get some erroneous errors and warnings for some names which are actually valid names. The question is one of having lowercase letters in the name. This is legal, and the name is case-insensitive. For reference, https://en.wikipedia.org/wiki/Unicode_character_property and https://scripts.sil.org/cms/scripts/page.php?id=unicodenames&site_id=nrsi. For example:

\N{greek SMALL LETTER DELTA}

This results in two messages:

error: 'greek SMALL LETTER DELTA' is not a valid Unicode character name
note: characters names in Unicode escape sequences are sensitive to case and whitespaces

clang should be sure to do a case-folded lookup on a given name. If the lookup fails, then the error message is appropriate, and a note about things like disallowed characters, leading or trailing or multiple spaces, multiple hyphens, etc which are actually present in the name would be appropriate. If these rules are violated, then the name lookup will fail, so you can do the lookup first.

llvmbot commented 1 month ago

@llvm/issue-subscribers-clang-frontend

Author: None (mrolle45)

When compiling a C++ file with the C++23 standard, which introduces named universal characters, I will get some erroneous errors and warnings for some names which are actually valid names. The question is one of having lowercase letters in the name. This is legal, and the name is case-insensitive. For reference, https://en.wikipedia.org/wiki/Unicode_character_property and https://scripts.sil.org/cms/scripts/page.php?id=unicodenames&site_id=nrsi. For example: ```c++ \N{greek SMALL LETTER DELTA} ``` This results in two messages: ``` error: 'greek SMALL LETTER DELTA' is not a valid Unicode character name note: characters names in Unicode escape sequences are sensitive to case and whitespaces ``` clang should be sure to do a case-folded lookup on a given name. If the lookup fails, then the error message is appropriate, and a note about things like disallowed characters, leading or trailing or multiple spaces, multiple hyphens, _etc_ which are actually present in the name would be appropriate. If these rules are violated, then the name lookup will fail, so you can do the lookup first.
cor3ntin commented 1 month ago

The c++ standard requires an exact match, so the behavior your are seeing is conforming. We could presumably demote the error to a warning and make it a conforming extension