Open tahonermann opened 3 years ago
We can always state that if a Byte Order Mark appears as the first character of a UTF-8 source file, it will be ignored. That's a good first start once we get UTF-8 files as a mandated part of support in the C++ abstract machine.
After that, between a pragma
directive versus a magic comment, I would infinitely prefer the pragma
. Such a pragma should be scoped to the implementation-defined concept of a file, lest it become a stateful entity that can spill over. We can mandata that such a pragma always supports #pragma encoding("UTF-8")
as an input, and other support is implementation-defined.
We can always state that if a Byte Order Mark appears as the first character of a UTF-8 source file, it will be ignored. That's a good first start once we get UTF-8 files as a mandated part of support in the C++ abstract machine.
Agreed.
After that, between a pragma directive versus a magic comment, I would infinitely prefer the pragma. ...
That was my initial inclination as well, but I've been leaning towards the magic comment approach more lately. The benefits are:
pragma
directives in existing compilers. Such directives are problematic for build systems that elevate warnings to errors. These probably shouldn't be SG16 concerns though. In the paper I've been threatening to finish for ~forever now, my plan has been to propose options for EWG to consider and make a decision on.
We can always state that if a Byte Order Mark appears as the first character of a UTF-8 source file, it will be ignored. That's a good first start once we get UTF-8 files as a mandated part of support in the C++ abstract machine.
P2295 does that
The set of encodings accepted for source files and the encoding actually used to interpret a source file are implementation-defined. From [lex.phases]p1.1:
Many compilers support multiple encodings that can be used with source files. For example, gcc allows the source file encoding to be specified with the
-finput-charset
option and Visual C++ with the/source-charset
option, but that encoding is then applied to all source files. Some compilers allow a per-file source encoding to be specified with a BOM or with an in-source syntax. For example, Visual C++ will recognize a source file with a UTF-8 BOM as being UTF-8 encoded and IBM's xlC compiler allows a source file to specify its encoding with a#pragma filetag
directive. The latter is similar to the Python encoding declaration or the HTML encoding declaration.The lack of a per-file mechanism to indicate source file encoding is an impediment to incremental adoption of UTF-8 since projects cannot rely on their public facing header files being interpreted as UTF-8 encoded.
P2295 proposes requiring implementations to support UTF-8 encoded source, but leaves the mechanism for how the UTF-8 encoding is selected as implementation-defined. That presents the possibility of implementations choosing different, possibly even conflicting, mechanisms (such conflicts exist today, while Visual C++ will honor a BOM, gcc will reject one unless it has already been directed to interpret such a source file as UTF-8 encoded).
The Unicode guidance for use of a BOM to determine file encoding is not clear. A paper was recently submitted to the Unicode consortium to clarify that guidance.
Possibilities for portably specifying a per-file source encoding include:
pragma
directive (like IBM xlC).