llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.29k stars 11.18k forks source link

C: any non-white space character may be a preprocessor token #97836

Open gustedt opened 4 weeks ago

gustedt commented 4 weeks ago

The C standard states in 6.4 p1

preprocessing-token: ... each non-white-space character that cannot be one of the above

That is, during preprocessing all characters have to be accepted, regardless if they are valid or not in later compilation phases. But clang in all versions that I tested does not accept the following code

#define stringify(...) #__VA_ARGS__
char A[] = stringify(¬);

Here the offending not sign does not survive preprocessing, but is integrated into a string literal, which is fine.

Gcc happily accepts this and produces a string with this character.

Thanks Jens

llvmbot commented 4 weeks ago

@llvm/issue-subscribers-clang-frontend

Author: Jens Gustedt (gustedt)

The C standard states in 6.4 p1 > preprocessing-token: > ... > each non-white-space character that cannot be one of the above That is, during preprocessing all characters have to be accepted, regardless if they are valid or not in later compilation phases. But clang in all versions that I tested does not accept the following code ```{.C} #define stringify(...) #__VA_ARGS__ char A[] = stringify(¬); ``` Here the offending not sign does not survive preprocessing, but is integrated into a string literal, which is fine. Gcc happily accepts this and produces a string with this character. Thanks Jens
llvmbot commented 4 weeks ago

@llvm/issue-subscribers-c

Author: Jens Gustedt (gustedt)

The C standard states in 6.4 p1 > preprocessing-token: > ... > each non-white-space character that cannot be one of the above That is, during preprocessing all characters have to be accepted, regardless if they are valid or not in later compilation phases. But clang in all versions that I tested does not accept the following code ```{.C} #define stringify(...) #__VA_ARGS__ char A[] = stringify(¬); ``` Here the offending not sign does not survive preprocessing, but is integrated into a string literal, which is fine. Gcc happily accepts this and produces a string with this character. Thanks Jens
Sirraide commented 4 weeks ago

CC @AaronBallman

cor3ntin commented 4 weeks ago

There seems to be some divergence between C and C++ here

C

Constraints 2 Each preprocessing token that is converted to a token shall have the lexical form of a keyword, an identifier, a constant, a string literal, or a punctuator. A single universal character name shall match one of the other preprocessing token categories. Semantics 3 A token is the minimal lexical element of the language in translation phases 7 and 8 (5.1.1.2). The categories of tokens are: keywords, identifiers, constants, string literals, and punctuators. A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The categories of preprocessing tokens are: header names, identifiers, preprocessing numbers, character constants, string literals, punctuators, and both single universal character names as well as single non-white-space characters that do not lexically match the other preprocessing token categories.60) If a ’ or a " character matches the last category, the behavior is undefined. Preprocessing tokens can be separated by white space; this consists of comments (described later), or white-space characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. As described in 6.10, in certain circumstances during translation phase 4, white space (or the absence thereof) serves as more than preprocessing token separation.

C++

A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. In this document, glyphs are used to identify elements of the basic character set ([lex.charset]). The categories of preprocessing token are: header names, placeholder tokens produced by preprocessing import and module directives (import-keyword, module-keyword, and export-keyword), identifiers, preprocessing numbers, character literals (including user-defined character literals), string literals (including user-defined string literals), preprocessing operators and punctuators, and single non-whitespace characters that do not lexically match the other preprocessing token categories. If a U+0027 APOSTROPHE or a U+0022 QUOTATION MARK character matches the last category, the program is ill-formed. If any character not in the basic character set matches the last category, the program is ill-formed.

This would be unfortunate.

gustedt commented 4 weeks ago

Am 5. Juli 2024 19:34:26 MESZ schrieb cor3ntin @.***>:

There seems to be some divergence between C and C++ here

C

Constraints 2 Each preprocessing token that is converted to a token shall have the lexical form of a keyword, an identifier, a constant, a string literal, or a punctuator. A single universal character name shall match one of the other preprocessing token categories. Semantics 3 A token is the minimal lexical element of the language in translation phases 7 and 8 (5.1.1.2). The categories of tokens are: keywords, identifiers, constants, string literals, and punctuators. A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The categories of preprocessing tokens are: header names, identifiers, preprocessing numbers, character constants, string literals, punctuators, and both single universal character names as well as single non-white-space characters that do not lexically match the other preprocessing token categories.60) If a ’ or a " character matches the last category, the behavior is undefined. Preprocessing tokens can be separated by white space; this consists of comments (described later), or white-space characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. As described in 6.10, in certain circumstances during translation phase 4, white space (or the absence thereof) serves as more than preprocessing token separation.

C++

A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. In this document, glyphs are used to identify elements of the basic character set ([lex.charset]). The categories of preprocessing token are: header names, placeholder tokens produced by preprocessing import and module directives (import-keyword, module-keyword, and export-keyword), identifiers, preprocessing numbers, character literals (including user-defined character literals), string literals (including user-defined string literals), preprocessing operators and punctuators, and single non-whitespace characters that do not lexically match the other preprocessing token categories. If a U+0027 APOSTROPHE or a U+0022 QUOTATION MARK character matches the last category, the program is ill-formed. If any character not in the basic character set matches the last category, the program is ill-formed.

This would be unfortunate.

This must have been a conscious decision on the C++ side at some point. Looks a bit hostile to me. I really don't get it why C++ is turning the shrews here, in a field that many claim that C++ should abandon anyhow.

In any case, for the case in question, for C, clang here is clearly not confirming.

Jens

-- Jens Gustedt - INRIA & ICube, Strasbourg, France

AaronBallman commented 3 weeks ago

There seems to be some divergence between C and C++ here This would be unfortunate.

Agreed; one outcome of this issue should be a paper to either WG21 or WG14 (or both, I suppose) asking for a change to unify behaviors.

Personally, I prefer C's behavior over C++'s behavior in this case. The preprocessor gets used for more than just C and C++ code (both notionally, as in "it could be a separate program" and practically, as in "it's used by things like assemblers and resource compilers"), so it seems to me that the preprocessor should be as language agnostic as plausible so that it continues to be generally useful. The compiler can then issue a diagnostic for any programming language that elects to have more restrictions on what characters can appear in identifiers, strings, comments, etc without impacting other tools relying on the preprocessor.

This must have been a conscious decision on the C++ side at some point.

It was: https://wg21.link/P2314R4

Looks a bit hostile to me.

I doubt it was done maliciously but the paper doesn't give any rationale for the change in behavior in this case. The old wording was:

and single universal-character-names and non-whitespace characters that do not lexically match the other preprocessing token categories. If a single universal-character-name does not match any of the other preprocessing token categories, the program is ill-formed.

so it used to be ill-formed only for invalid UCNs but it was broadened by the paper to include any non-whitespace characters not in the basic character set.

This seems like a good topic for SG22 and SG16, IMO.

zygoloid commented 3 weeks ago

Personally, I prefer C's behavior over C++'s behavior in this case. The preprocessor gets used for more than just C and C++ code (both notionally, as in "it could be a separate program" and practically, as in "it's used by things like assemblers and resource compilers"), so it seems to me that the preprocessor should be as language agnostic as plausible so that it continues to be generally useful. The compiler can then issue a diagnostic for any programming language that elects to have more restrictions on what characters can appear in identifiers, strings, comments, etc without impacting other tools relying on the preprocessor.

Relatedly: the rule that pp-tokens formed by token-paste must be lexically valid also causes problems in similar cases. If we want to allow non-token "stuff" to be able to live as a pp-token, then why should we not also allow (for example) e+1 as a single pp-token? Just like the single non-basic-source-character case, that wouldn't be valid if converted from pp-token to token, but could be valid if you stringified it or pasted 0 to the start first. Allowing single-character "garbage" tokens but not multi-character ones seems inconsistent.