Limit permissible execution encodings to match existing practice

WPMGPRoSToTeMa commented 5 years ago

I know only one encoding that is not compatible with ASCII, it's EBCDIC. But is EBCDIC relevant today? Do you know any other encoding that is not compatible with ASCII?

cubbimew commented 5 years ago

What would definition of ASCII do that C0 controls and Basic Latin (U+0000..U+007F) doesn't?

Do you know any other encoding that is not compatible with ASCII?

I know a few, but they are as obsolete as ISO-626

WPMGPRoSToTeMa commented 5 years ago

What would definition of ASCII do that C0 controls and Basic Latin (U+0000..U+007F) doesn't?

I think so.

tahonermann commented 5 years ago

But is EBCDIC relevant today?

It is relevant in the industry, yes. z/OS maintains a significant commercial presence in the industry. Market share information can be difficult to find, but my understanding is that 80+% of fortune 500 companies use z/OS and that 80+% of credit card transactions are processed by z/OS systems.

The question of whether EBCDIC is relevant for C++ going forward is more difficult to answer.

There are two C++ compilers available for z/OS today. IBM continues to provide xlC for z/OS (http://www-01.ibm.com/support/docview.wss?uid=swg27036892) and also recently started providing a LLVM based C++11 compiler (njsc++) along with their Node.js offering (https://www.ibm.com/support/knowledgecenter/en/SSTRRS_6.0.0/com.ibm.nodejs.zos.v6.doc/understand.htm). Additionally, IBM now provides Swift for z/OS (https://developer.ibm.com/mainframe/products/ibm-toolkit-swift-z-os). Since Swift is based on LLVM and Clang, that implies that IBM has done most of the work to enable Clang on z/OS and a mailing list post implies such progress (https://lists.swift.org/pipermail/swift-dev/Week-of-Mon-20170508/004572.html). Should we expect to see a C++17 or C++20 compiler for z/OS emerge? I don't know, but I think we can't preclude the possibility.

Do you know any other encoding that is not compatible with ASCII?

Shift JIS (and its variants) are not ASCII compatible. In particular, Shift JIS as defined by JIS X 0208:1997 remaps 0x5C and 0x7E from \ (U+005C REVERSE SOLIDUS) and ~ (U+007E TILDE) in ASCII to ¥ (U+00A5 YEN SIGN) and ‾ (U+203E OVERLINE).

jfbastien commented 5 years ago

It would be nice to propose restricting what char can be. A paper could propose alternatives:

allowing only ASCII
allowing only ASCII and EBCDIC

We then discuss it in WG21, and figure out votes. Make sure IBM is on board, and record the votes. Even option 2. would be neat to get to IMO.

cubbimew commented 5 years ago

That sounds like a proposal to kill off locales.

tahonermann commented 5 years ago

It would be nice to propose restricting what char can be.

We could add restrictions on permissible execution encodings, but I'm not sure that doing so would accomplish anything in practice. I'm not too worried about any new implementations that use a non-ASCII (or non-UTF-8 really) execution encoding suddenly becoming popular :)

That sounds like a proposal to kill off locales.

Not necessarily. There is a distinction between the execution encoding that is determined at run-time (via the Windows active code page setting or the POSIX LANG family of environment variables) and the presumed execution encoding that is known at compile time and used for string and character literals. The presumed execution encoding already places limits on what (run-time) execution encodings can be used without introducing problems like those that happen on Windows when the active code page is Shift JIS (due to the ASCII incompatibilities mentioned earlier). Basically, the presumed execution encoding must correspond to an encoding that is compatible with the set of supported (run-time) execution encodings. Limiting the presumed execution encoding to ASCII would not prohibit use of UTF-8, ISO-8859-1, or other ASCII derived encodings as the execution encoding selected at run-time.

cubbimew commented 5 years ago

There is one interesting benefit of requiring ASCII as the encoding of char constants (to be specific which meaning of char this is about): it would restore C's original requirement that 'a' == L'a' for every member of the basic charset (it was dropped in 2002-2004 specifically to support EBCDIC systems, see WG14 DR 279 and WG14 DR 321 that introduced __STDC_MB_MIGHT_NEQ_WC__ to mark such systems)

jfbastien commented 5 years ago

It would be nice to propose restricting what char can be.

We could add restrictions on permissible execution encodings, but I'm not sure that doing so would accomplish anything in practice. I'm not too worried about any new implementations that use a non-ASCII (or non-UTF-8 really) execution encoding suddenly becoming popular :)

Agreed that new ones won't come along, but I'd make the same argument as for defining two's complement: it removes useless implementation leeway, and has a few side-benefits.

WPMGPRoSToTeMa commented 5 years ago

Should we expect to see a C++17 or C++20 compiler for z/OS emerge? I don't know, but I think we can't preclude the possibility.

I can't understand why the committee removed trigraphs (if we still support EBCDIC).

tahonermann commented 5 years ago

Agreed that new ones won't come along, but I'd make the same argument as for defining two's complement: it removes useless implementation leeway, and has a few side-benefits.

I agree if we can identify some useful side-benefits.

One benefit would presumably be that the basic source character set would be extended to include additional characters common to all allowed execution encodings. This might includes characters such as $ (U+0024 DOLLAR SIGN), @ (U+0040 COMMERCIAL AT), and ` (U+0060 GRAVE ACCENT). (I'm not sure if all EBCDIC code pages support these characters. At least some EBCDIC code pages assign them different code point values). It would be nice to be able to write email addresses in standard compliant portable C++ code! Of course, we could extend the basic source character set without restricting the permissible set of execution encodings.

Any other suggested benefits?

I can't understand why the committee removed trigraphs (if we still support EBCDIC).

Trigraphs were added to workaround the absence of keys for some characters on keyboards, not for character set limitations. The inability to type a character on modern keyboards (or via customizable key mapping software) is presumably a rare problem these days. Trigraphs were problematic because programmers accidentally wrote them quite often in string literals (note that digraphs don't have this problem because they represent tokens, not characters).

jfbastien commented 5 years ago

I can't understand why the committee removed trigraphs (if we still support EBCDIC).

Because trigraph is a pre-processing step for source code and can therefore be removed as explained in http://wg21.link/N3981, whereas EBCDIC is a runtime thing which cannot be removed if it's still relevant.

WPMGPRoSToTeMa commented 5 years ago

@tahonermann @jfbastien thanks for the clarification.

sg16-unicode / sg16

Limit permissible execution encodings to match existing practice #37