sg16-unicode / sg16

SG16 overview and general information
45 stars 5 forks source link

UTF-8 -- Can we just move forward please? #52

Closed DBJDBJ closed 4 years ago

DBJDBJ commented 4 years ago

C++20 cut off date is nearing. u8 implementation is not ready yet. I might be so bold to think in general, and to think in particular, one bottom-line decision can be made to sign off the smooth delivery of utf-8 aka u8, for both WG14 and WG21. In time for C++20 cut off date.

Just please decide on, and then ask the following to be implemented.

    `printf("%s8", u8"ひらがな"); // char8_t * 

    `printf("%c8", u8'ひ'); // char8_t`

After that happens, everything else is smooth and logical. It seems obvious, char8_t full and simple decision and implementation is the top priority right now (2019 NOV)


Yes, I am that troublemaker. :)

tahonermann commented 4 years ago

Per P1000, the C++20 feature cut off date has come and gone. We are in bug fix mode now and the issue raised is not a defect in the standard.

WG14 has not adopted char8_t but re-proposing it is an active project on my todo list (as is proposing a number of other SG16 related proposals adopted in C++20). The next WG14 standard is not expected for at least three years.

I don't agree that the suggested change is obvious or even desirable. "%s8" and "%c8" have well defined meanings today, so this would be a breaking change. It also isn't clear to me what behavior is being proposed. Should printf just pass the u8 inputs straight through? Or transcode them to one of the execution character set or locale dependent character encoding (which one and why)?

This issue requires a paper that presents possible options and analysis of pros and cons of each. I encourage further discussion on either the SG16 Slack channel or SG16 mailing list with the goal being to produce a paper (either with or without a specific proposal attached).

In the meantime, I'm going to close this issue as non-actionable as written. I'm open to creation of a new issue that describes the problem to be solved in more detail (probably with separate issues for printf vs iostreams), and without consideration of ship target. But given that this is a feature request, I'd like for any proposed solutions to be submitted as papers following the process documented at https://isocpp.org/std/submit-a-proposal.

DBJDBJ commented 4 years ago

Tim, I am an IT Architect. The most valuable tool I have is POV (Point of View) of a user/customer/client.

(Imagine) we are on the project using C++20. Let say in gaming or even flight systems. Thus we have no std::, no exceptions, no heap allocation and such.

But we are delivering server-side components. And "lo and behold" there are other servers with other components, developed on different timelines by other projects. Naturally, on the system level, JSON is used to exchange messages. And data.

How do we create, send and receive JSON messages from our C++20 components?

ThePhD commented 4 years ago

You use external libraries, like everyone else.

DBJDBJ commented 4 years ago

What is the point then of char8_t ?

ThePhD commented 4 years ago

To give us room to breathe to make things better for C++23, since we are out of time for the C++20 cycle for new features.

DBJDBJ commented 4 years ago

char8_t .. is that not a C++20 keyword?

ThePhD commented 4 years ago

Correct, it is.

That does not mean the rest of the library is ready to adapt to such a new keyword. This is why Tom has spent time reserving and clearing things so that in the future we do not have to fight with people relying on broken or "happens to work" practices in the Standard Library: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r3.html

If you would like to help for C++23 features, SG16 holds regular teleconferences and discusses all things related to text processing and Unicode in the Standard. You're more than happy to contribute ideas, implementation strategy, specification and more. WG14 is also interested in many text processing improvements, for which proposals are already out and being worked on. WG14 also has an open procedure for receiving papers, just e-mail the ISO C Convener. More information can be found here: http://www.open-std.org/jtc1/sc22/wg14/

C++20 is closed for changes. I don't know how else you'd like us to spell that for you in order for you to understand that.

tahonermann commented 4 years ago

Tim, I am an IT Architect. The most valuable tool I have is POV (Point of View) of a user/customer/client.

Tom actually :) I agree with POV being a valuable tool. That is exactly why I've asked for your thoughts on those SO posts and to contribute your thoughts to the SG16 mailing list.

With regard to JSON, there are lots of choices available. As far as I know, a standard JSON interfaces has never been proposed to WG21. I'm not sure why you feel JSON is relevant for this topic though.

What is the point then of char8_t ?

I see three primary benefits: 1) It enables type safety and allows use of overloading to distinguish between foo("text") and foo(u8"text"). It is currently too easy to accidentally mix UTF-8 and text encoded according to the execution character set or locale character encoding. 2) It allows us to reason about text encoding without concern for locale. 3) It provides a non-aliasing type that can be better optimized.

Effectively, it is foundational. There won't be much support for it in the C++20 standard library and that is unfortunate, but we will continue to build on it for C++23 and later. Having char8_t in C++20 will allow for experimentation outside of the standard. For example, you can use it to implement a UTF-8 only JSON interface (perhaps even one that would be suitable for standardization).

DBJDBJ commented 4 years ago

@ThePhD , I understand C++20 is "closed" .. I am just trying to find out what is going to be implemented, in particular about utf8. Also, I am not interested in the std:: lib. Just the core language. It seems, I have to spell that. NP.

@tahonermann , sorry yes it is Tom, not Tim of course :) I completely understand the foundation char8_t makes possible, in C++23. I am sorry for not articulating my questions better. That might be partially due to the lack of overview kind-of-information, not lack of papers. My main (re)search aid is thus wandbox ...

I have seen your email and I will join the Slack channel.