A portable mechanism to specify source file encoding

tahonermann commented 3 years ago

The set of encodings accepted for source files and the encoding actually used to interpret a source file are implementation-defined. From [lex.phases]p1.1:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. ...

Many compilers support multiple encodings that can be used with source files. For example, gcc allows the source file encoding to be specified with the -finput-charset option and Visual C++ with the /source-charset option, but that encoding is then applied to all source files. Some compilers allow a per-file source encoding to be specified with a BOM or with an in-source syntax. For example, Visual C++ will recognize a source file with a UTF-8 BOM as being UTF-8 encoded and IBM's xlC compiler allows a source file to specify its encoding with a #pragma filetag directive. The latter is similar to the Python encoding declaration or the HTML encoding declaration.

The lack of a per-file mechanism to indicate source file encoding is an impediment to incremental adoption of UTF-8 since projects cannot rely on their public facing header files being interpreted as UTF-8 encoded.

P2295 proposes requiring implementations to support UTF-8 encoded source, but leaves the mechanism for how the UTF-8 encoding is selected as implementation-defined. That presents the possibility of implementations choosing different, possibly even conflicting, mechanisms (such conflicts exist today, while Visual C++ will honor a BOM, gcc will reject one unless it has already been directed to interpret such a source file as UTF-8 encoded).

The Unicode guidance for use of a BOM to determine file encoding is not clear. A paper was recently submitted to the Unicode consortium to clarify that guidance.

Possibilities for portably specifying a per-file source encoding include:

A magic comment (like Python).
A pragma directive (like IBM xlC).
A BOM (like Visual C++).

ThePhD commented 3 years ago

We can always state that if a Byte Order Mark appears as the first character of a UTF-8 source file, it will be ignored. That's a good first start once we get UTF-8 files as a mandated part of support in the C++ abstract machine.

After that, between a pragma directive versus a magic comment, I would infinitely prefer the pragma. Such a pragma should be scoped to the implementation-defined concept of a file, lest it become a stateful entity that can spill over. We can mandata that such a pragma always supports #pragma encoding("UTF-8") as an input, and other support is implementation-defined.

tahonermann commented 3 years ago

We can always state that if a Byte Order Mark appears as the first character of a UTF-8 source file, it will be ignored. That's a good first start once we get UTF-8 files as a mandated part of support in the C++ abstract machine.

Agreed.

After that, between a pragma directive versus a magic comment, I would infinitely prefer the pragma. ...

That was my initial inclination as well, but I've been leaning towards the magic comment approach more lately. The benefits are:

Adopting the Python magic comment syntax would result in a cross-language solution that doesn't impose additional language specific encoding concerns on tools and editors.
A comment based approach won't trigger warnings due to unrecognised pragma directives in existing compilers. Such directives are problematic for build systems that elevate warnings to errors.

These probably shouldn't be SG16 concerns though. In the paper I've been threatening to finish for ~forever now, my plan has been to propose options for EWG to consider and make a decision on.

cor3ntin commented 3 years ago

We can always state that if a Byte Order Mark appears as the first character of a UTF-8 source file, it will be ignored. That's a good first start once we get UTF-8 files as a mandated part of support in the C++ abstract machine.

P2295 does that

sg16-unicode / sg16

A portable mechanism to specify source file encoding #71