llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
26.99k stars 11.06k forks source link

-fdollar-in-identifiers support lacking behind gcc/tinycc #95740

Open pskocik opened 4 weeks ago

pskocik commented 4 weeks ago

This is probably an extension, but all gcc, tinycc, and clang support transient weird token in preprocessor token concatenation, e.g.,

#define CAT(A,B) CAT_(A,B)
#define CAT_(A,B) A ## B
int CAT(X,CAT(1,_)); //✓ on clang, gcc, and tinycc

works on all three even though it transiently creates the weird token 1_.

On clang, unlike the other two, this does not work with $ in place of _, or anything containing $ in the suffix.

int CAT(X,CAT(1,$));  //✗ on clang; ✓ on gcc and tinycc

Could be worth fixing. It's kind of a weird inconsistency, even within clang itself.

https://godbolt.org/z/YGzd1h1Gc

llvmbot commented 4 weeks ago

@llvm/issue-subscribers-clang-frontend

Author: None (pskocik)

This is probably an extension, but all gcc, tinycc, and clang support transient weird token in preprocessor token concatenation, e.g., ``` #define CAT(A,B) CAT_(A,B) #define CAT_(A,B) A ## B int CAT(X,CAT(1,_)); //✓ on clang, gcc, and tinycc ``` works on all three even though it transiently creates the weird token 1_. On clang, unlike the other two, this does not work with $ in place of _, or anything containing $ in the suffix. ``` int CAT(X,CAT(1,$)); //✗ on clang; ✓ on gcc and tinycc ``` Could be worth fixing. It's kind of a weird inconsistency, even within clang itself. https://godbolt.org/z/YGzd1h1Gc
Sirraide commented 4 weeks ago

Looks like the reason it works with _ is because 1_ is lexed as an integer literal w/ a (UDL) suffix of _: https://godbolt.org/z/echxb68W6.

cor3ntin commented 4 weeks ago

1foo is a valid pp-number in both C and C++ (regardless of UDL) (basically anything that starts with a digit is a pp-number https://eel.is/c++draft/lex.ppnumber#nt:pp-number )

This let the preprocessor not care too much about parsing number (and some of these pp-numbers can indeed end up being valid UDL)

Note that 1$ is not valid outside of the preprocessor but X1$ ought to be (in modes where $ are valid in identifiers)

It's also true that we have a bunch of UDL related bugs https://godbolt.org/z/WsxncWeYM

That being said, I do believe encouraging the use of $ in identifiers is not advisable given the negative impact that has on the evolutivity of C++, so I don't know if we want to expand a lot of energy fixing these corner cases with the use of $.

pskocik commented 4 weeks ago

Looks like the reason it works with _ is because 1_ is lexed as an integer literal w/ a (UDL) suffix of _: https://godbolt.org/z/echxb68W6.

Looks like it works with any [0-9A-Z_a-z]* suffix (even empty) (same on other C compilers), but fails when a token is formed that starts with a digit and has any $ in it, even if later concatenation makes it not start with a digit (and such a final form of a token would otherwise be accepted if inputted directly). https://godbolt.org/z/v715E3n8n

I was testing with (and am interested in) the C frontend but looks like it behaves the same with -xc++.