llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.03k stars 11.07k forks source link

Trigraph support in various C language modes #46989

Open AaronBallman opened 3 years ago

AaronBallman commented 3 years ago
Bugzilla Link 47645
Version trunk
OS Windows NT
CC @DougGregor,@hubert-reinterpretcast,@zygoloid

Extended Description

Clang's trigraph support is disabled by default when the language standard is set to GNU C mode. Not supporting trigraphs is perhaps reasonable because the user is specifying that they want something other than standard C.

However, when no -std= option is passed on the command line, we silently default to gnu17 mode which then disables this standard C feature and that seems less reasonable to me because this is a case where we're removing support for a feature rather than adding a conforming extension. The user is saying "here's my C file, please compile it" and I think we should be able to compile conforming C in that case (for some C standard version, but all of them currently support trigraphs).

Confounding matters somewhat, we define STDC to 1 in GNU mode despite trigraph support in C being mandatory.

Should trigraphs remain enabled when no -std is present (even though we would still default to gnu17) so that users can compile conforming C code? Should STDC be paying attention to whether trigraph support is enabled or not?

AaronBallman commented 3 years ago

it seems wrong to me that we diagnose a standard construct by default when that construct isn't particularly dangerous.

Trigraphs can be problematic, although the extent of the problem seems to vary between codebases. Specifically, issues arise because they are replaced inside string literals -- at least, that's where we saw the most problems with them, and that was the primary motivation for my pushing to have them removed from C++. Some string templating libraries use ?s as placeholders, and trigraphs are especially problematic in those cases.

Yeah, I figured string literals were about the only place the issue would be bad. But I think it's more defensible to ask people to opt into a diagnostic for accidental use of a feature than it is to ask people to opt out of a diagnostic for standard functionality. There's a bit of a parallel here with -Wvla for accidental VLA usage, which is disabled by default.

But trigraphs are a bit different -- in GNU mode, we don't support them by default (so GNU mode is not a superset of C)

(Well, that's the really point of GNU mode in general, as distinct from GNU extensions -- it's all the things that aren't conforming extensions. GNU C is a dialect of C, not a superset of C.)

Good point!

and in standard C mode, we support them by default but tell the user "you've gone ahead and used a trigraph" which is a bit like telling the user "you used an addition operator" (accurate yet basically useless).

I think there's a big difference between those two cases: there are no reasons to intentionally use trigraphs except to confuse the reader or to avoid encoding issues. Given that the only encoding we support is UTF-8, there is no encoding problem for trigraphs to solve. So if we see a trigraph, either the programmer is trying to be confusing or they made a mistake, and either way that warrants a warning.

The situation I'm most concerned about is a user who has correct code using trigraphs (regardless of why) with another compiler and they try to compile that code with Clang. I don't think that fits into either one of your two cases but is a third, distinct case, because the user may reasonably want their code to compile with both compilers. Trigraphs are part of the C standard and one major point to having a standard is to have some hope of code portability between implementations.

There are non-UTF-8 cases where there are genuine encoding issues that trigraphs affect. For example (second-hand info from talking to folks from IBM, any errors my own): there are multiple EBCDIC code pages, and the encoding of '#' is not consistent across them, but a single compilation will often enter (via #include) files with different encodings. There is a convention to use a ??=pragma to specify the encoding for each such file. If we were to add support for EBCDIC encodings, I would expect whatever flag enables such support would also recognize at least the ??= trigraph (at least at the start of the file) as part of that encoding mechanism.

I find the fact that Clang accepts non-standard keywords by default to be substantially more objectionable. In any case, I think our goal should be to change our default -std= mode from gnu to c. Does that seem reasonable?

I didn't dare to suggest that as an option because I figured it would be a non-starter, but if you think we can convince the community to go this route, I'm all for it. I think that having the compiler default to the standards-comformant mode is a better approach long-term than continuing to have developers silently introduce extensions to their code without noticing.

Do you envision the switch of default standard modes to also encompass whether we want Clang to pretend to be GCC by default as well (for things like _GNU_SOURCE or GNUC)? It sort of seems odd that if we don't enable GNU extensions we still enable macros like GNUC.

It's important to distinguish between GNU mode (the GNU C and GNU C++ languages) and GNU extensions here. GNU C and GNU C++ are languages much like ISO C and ISO C++, but with some differences; valid programs in GNU C aren't necessarily valid programs in ISO C and vice versa, but there's a very high degree of overlap. (For example, GNU C89 has 'typeof', 'asm', 'inline' as keywords, whereas ISO C89 has trigraphs.) That's what -std=cXY versus -std=gnuXY controls.

GNU extensions, on the other hand, are a set of conforming extensions to C-family languages, that occupy syntactic or semantic space that is either reserved or unused by the underlying language. (For example, attributes, statement expressions, typeof, constant folding of array bounds.) That's what GNUC__ indicates the presence of.

So, changing the default -std= mode shouldn't change whether we define GNUC. (Another way to reach the same conclusion: GCC defines GNUC in -std=cXY modes.)

Ah, thank you for that explanation.

_GNU_SOURCE and __STRICT_ANSI__ add more complication:

  • __STRICT_ANSI__ is defined when we're not in GNU mode (nor -fms-compatibility), and it disables C standard library extensions. Howver,
  • _GNU_SOURCE is defined when we're in any C++ mode (regardless of GNU mode), and enables (or with __STRICT_ANSI__, re-enables) all glibc extensions; both libstdc++ and libc++ have at one time or another relied on at least some of these extensions, but I don't know if they still do

So changing the default from -std=gnu to -std=c would have no effect on the standard library interface made available in C++ (it would remain wrong, due to _GNU_SOURCE). But it would disable all standard library extensions in C. (Notably, those extensions are generally non-conforming.)

That does add complications. :-(

hubert-reinterpretcast commented 3 years ago

Well, the implementation-defined portion of phase 1 of translation permits us to map trigraphs to alternative character sequences (this is the same "we don't need trigraphs to be in the standard to accept them" argument that was made to justify their removal from C++, but in reverse).

Agreed, the phase one mapping is allowed to insert backslash-newline to escape trigraphs.

ec04fc15-fa35-46f2-80e1-5d271f2ef708 commented 3 years ago

it seems wrong to me that we diagnose a standard construct by default when that construct isn't particularly dangerous.

Trigraphs can be problematic, although the extent of the problem seems to vary between codebases. Specifically, issues arise because they are replaced inside string literals -- at least, that's where we saw the most problems with them, and that was the primary motivation for my pushing to have them removed from C++. Some string templating libraries use ?s as placeholders, and trigraphs are especially problematic in those cases.

But trigraphs are a bit different -- in GNU mode, we don't support them by default (so GNU mode is not a superset of C)

(Well, that's the really point of GNU mode in general, as distinct from GNU extensions -- it's all the things that aren't conforming extensions. GNU C is a dialect of C, not a superset of C.)

and in standard C mode, we support them by default but tell the user "you've gone ahead and used a trigraph" which is a bit like telling the user "you used an addition operator" (accurate yet basically useless).

I think there's a big difference between those two cases: there are no reasons to intentionally use trigraphs except to confuse the reader or to avoid encoding issues. Given that the only encoding we support is UTF-8, there is no encoding problem for trigraphs to solve. So if we see a trigraph, either the programmer is trying to be confusing or they made a mistake, and either way that warrants a warning.

There are non-UTF-8 cases where there are genuine encoding issues that trigraphs affect. For example (second-hand info from talking to folks from IBM, any errors my own): there are multiple EBCDIC code pages, and the encoding of '#' is not consistent across them, but a single compilation will often enter (via #include) files with different encodings. There is a convention to use a ??=pragma to specify the encoding for each such file. If we were to add support for EBCDIC encodings, I would expect whatever flag enables such support would also recognize at least the ??= trigraph (at least at the start of the file) as part of that encoding mechanism.

I find the fact that Clang accepts non-standard keywords by default to be substantially more objectionable. In any case, I think our goal should be to change our default -std= mode from gnu to c. Does that seem reasonable?

I didn't dare to suggest that as an option because I figured it would be a non-starter, but if you think we can convince the community to go this route, I'm all for it. I think that having the compiler default to the standards-comformant mode is a better approach long-term than continuing to have developers silently introduce extensions to their code without noticing.

Do you envision the switch of default standard modes to also encompass whether we want Clang to pretend to be GCC by default as well (for things like _GNU_SOURCE or GNUC)? It sort of seems odd that if we don't enable GNU extensions we still enable macros like GNUC.

It's important to distinguish between GNU mode (the GNU C and GNU C++ languages) and GNU extensions here. GNU C and GNU C++ are languages much like ISO C and ISO C++, but with some differences; valid programs in GNU C aren't necessarily valid programs in ISO C and vice versa, but there's a very high degree of overlap. (For example, GNU C89 has 'typeof', 'asm', 'inline' as keywords, whereas ISO C89 has trigraphs.) That's what -std=cXY versus -std=gnuXY controls.

GNU extensions, on the other hand, are a set of conforming extensions to C-family languages, that occupy syntactic or semantic space that is either reserved or unused by the underlying language. (For example, attributes, statement expressions, typeof, constant folding of array bounds.) That's what GNUC__ indicates the presence of.

So, changing the default -std= mode shouldn't change whether we define GNUC. (Another way to reach the same conclusion: GCC defines GNUC in -std=cXY modes.)

_GNU_SOURCE and __STRICT_ANSI__ add more complication:

So changing the default from -std=gnu to -std=c would have no effect on the standard library interface made available in C++ (it would remain wrong, due to _GNU_SOURCE). But it would disable all standard library extensions in C. (Notably, those extensions are generally non-conforming.)

AaronBallman commented 3 years ago

Well, the implementation-defined portion of phase 1 of translation permits us to map trigraphs to alternative character sequences (this is the same "we don't need trigraphs to be in the standard to accept them" argument that was made to justify their removal from C++, but in reverse). Due to encoding issues, conformance of source files is not meaningful until after phase 1, and trigraphs are at least in some sense an encoding issue.

I agree with your assessment in terms of how to read the standard to justify the behavior, but it seems wrong to me that we diagnose a standard construct by default when that construct isn't particularly dangerous. For instance, diagnosing uses of gets() seems very reasonable to me even in C89 mode because gets() is broken. But trigraphs are a bit different -- in GNU mode, we don't support them by default (so GNU mode is not a superset of C) and in standard C mode, we support them by default but tell the user "you've gone ahead and used a trigraph" which is a bit like telling the user "you used an addition operator" (accurate yet basically useless).

Should we reevaluate whether we want to diagnose use of trigraphs in C mode by default? I feel like this should be an opt-in diagnostic unless WG14 decides to match WG21 and remove support for them (which would surprise me). I suspect that most accidental uses of trigraphs appear in comments where the only problematic trigraph is ??/, which seems unlikely to be an accidental trigraph, or in a string literal (which I admit could be a surprise for "??!" or "??)", but seems rather unlikely to be a common issue given that you need the literal to contain adjacent question marks).

I find the fact that Clang accepts non-standard keywords by default to be substantially more objectionable. In any case, I think our goal should be to change our default -std= mode from gnu to c. Does that seem reasonable?

I didn't dare to suggest that as an option because I figured it would be a non-starter, but if you think we can convince the community to go this route, I'm all for it. I think that having the compiler default to the standards-comformant mode is a better approach long-term than continuing to have developers silently introduce extensions to their code without noticing.

Do you envision the switch of default standard modes to also encompass whether we want Clang to pretend to be GCC by default as well (for things like _GNU_SOURCE or GNUC)? It sort of seems odd that if we don't enable GNU extensions we still enable macros like GNUC.

ec04fc15-fa35-46f2-80e1-5d271f2ef708 commented 3 years ago

I don't think we should invent a new family of -std= modes to use by default. And I don't think it would be reasonable to change what -std=gnu means. Instead, I think we should consider changing our default from -std=gnu to -std=c*.

I've previously proposed changing this, but unfortunately there were suggestions that some targets have / had system headers that rely on non-conforming GNU extensions (particularly the GNU C keywords that aren't part of ISO C). It would be useful to know exactly which targets those are and how close they are to not relying on such extensions, and if we can encourage them to move to more-standard constructs.

However, when no -std= option is passed on the command line, we silently default to gnu17 mode which then disables this standard C feature and that seems less reasonable to me because this is a case where we're removing support for a feature rather than adding a conforming extension.

Well, the implementation-defined portion of phase 1 of translation permits us to map trigraphs to alternative character sequences (this is the same "we don't need trigraphs to be in the standard to accept them" argument that was made to justify their removal from C++, but in reverse). Due to encoding issues, conformance of source files is not meaningful until after phase 1, and trigraphs are at least in some sense an encoding issue.

I find the fact that Clang accepts non-standard keywords by default to be substantially more objectionable. In any case, I think our goal should be to change our default -std= mode from gnu to c. Does that seem reasonable?

llvmbot commented 1 week ago

@llvm/issue-subscribers-clang-frontend

Author: Aaron Ballman (AaronBallman)

| | | | --- | --- | | Bugzilla Link | [47645](https://llvm.org/bz47645) | | Version | trunk | | OS | Windows NT | | CC | @DougGregor,@hubert-reinterpretcast,@zygoloid | ## Extended Description Clang's trigraph support is disabled by default when the language standard is set to GNU C mode. Not supporting trigraphs is perhaps reasonable because the user is specifying that they want something other than standard C. However, when *no* -std= option is passed on the command line, we silently default to gnu17 mode which then disables this standard C feature and that seems less reasonable to me because this is a case where we're *removing* support for a feature rather than adding a conforming extension. The user is saying "here's my C file, please compile it" and I think we should be able to compile conforming C in that case (for some C standard version, but all of them currently support trigraphs). Confounding matters somewhat, we define __STDC__ to 1 in GNU mode despite trigraph support in C being mandatory. Should trigraphs remain enabled when no -std is present (even though we would still default to gnu17) so that users can compile conforming C code? Should __STDC__ be paying attention to whether trigraph support is enabled or not?