sg16-unicode / sg16

SG16 overview and general information
45 stars 5 forks source link

A portable mechanism to specify source file encoding #71

Open tahonermann opened 3 years ago

tahonermann commented 3 years ago

The set of encodings accepted for source files and the encoding actually used to interpret a source file are implementation-defined. From [lex.phases]p1.1:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. ...

Many compilers support multiple encodings that can be used with source files. For example, gcc allows the source file encoding to be specified with the -finput-charset option and Visual C++ with the /source-charset option, but that encoding is then applied to all source files. Some compilers allow a per-file source encoding to be specified with a BOM or with an in-source syntax. For example, Visual C++ will recognize a source file with a UTF-8 BOM as being UTF-8 encoded and IBM's xlC compiler allows a source file to specify its encoding with a #pragma filetag directive. The latter is similar to the Python encoding declaration or the HTML encoding declaration.

The lack of a per-file mechanism to indicate source file encoding is an impediment to incremental adoption of UTF-8 since projects cannot rely on their public facing header files being interpreted as UTF-8 encoded.

P2295 proposes requiring implementations to support UTF-8 encoded source, but leaves the mechanism for how the UTF-8 encoding is selected as implementation-defined. That presents the possibility of implementations choosing different, possibly even conflicting, mechanisms (such conflicts exist today, while Visual C++ will honor a BOM, gcc will reject one unless it has already been directed to interpret such a source file as UTF-8 encoded).

The Unicode guidance for use of a BOM to determine file encoding is not clear. A paper was recently submitted to the Unicode consortium to clarify that guidance.

Possibilities for portably specifying a per-file source encoding include:

ThePhD commented 3 years ago

We can always state that if a Byte Order Mark appears as the first character of a UTF-8 source file, it will be ignored. That's a good first start once we get UTF-8 files as a mandated part of support in the C++ abstract machine.

After that, between a pragma directive versus a magic comment, I would infinitely prefer the pragma. Such a pragma should be scoped to the implementation-defined concept of a file, lest it become a stateful entity that can spill over. We can mandata that such a pragma always supports #pragma encoding("UTF-8") as an input, and other support is implementation-defined.

tahonermann commented 3 years ago

We can always state that if a Byte Order Mark appears as the first character of a UTF-8 source file, it will be ignored. That's a good first start once we get UTF-8 files as a mandated part of support in the C++ abstract machine.

Agreed.

After that, between a pragma directive versus a magic comment, I would infinitely prefer the pragma. ...

That was my initial inclination as well, but I've been leaning towards the magic comment approach more lately. The benefits are:

These probably shouldn't be SG16 concerns though. In the paper I've been threatening to finish for ~forever now, my plan has been to propose options for EWG to consider and make a decision on.

cor3ntin commented 3 years ago

We can always state that if a Byte Order Mark appears as the first character of a UTF-8 source file, it will be ignored. That's a good first start once we get UTF-8 files as a mandated part of support in the C++ abstract machine.

P2295 does that