sg16-unicode / sg16

SG16 overview and general information
45 stars 5 forks source link

`std::to_chars`/`std::from_chars` overloads for `char8_t` #38

Open WPMGPRoSToTeMa opened 5 years ago

WPMGPRoSToTeMa commented 5 years ago

std::to_chars/std::from_chars are mostly intended to be used in parsers, like JSON parser for example. JSON requires ASCII encoding at least, especially if it is transmitted over the Internet. So, you have to use char8_t instead of char since the execution encoding can be ASCII-incompatible. Of course you can make conversion from char encoding to ASCII (and vice versa) before/after using std::to_chars/std::from_chars or you can write your own parse functions for char8_t, but I think it would be better to have std::to_chars/std::from_chars overloads for char8_t.

tahonermann commented 5 years ago

Thanks for filing the issue. I realize now I misunderstood what you were suggesting in issue #5. I previously thought you were requesting conversions from char8_t (code unit) values (in retrospect, I have no idea why I thought that).

There are many functions in the standard that we may want to provide overloads for in support of UTF-8 (and UTF-16 and UTF-32). So far, the committee has been quite conservative in adding overloads for char16_t and char32_t. What about printf, snprintf, cout, the proposed std::format (P0645), etc...? We have a lot of options (UTF all the things!, UTF-8 all the things!, UTF some subset of the things!) and I'm not sure how much patience the committee will have for adding lots of overloads.

To some extent, std::to_chars and std::from_chars are easy cases since they only deal with characters from the basic source character set and transcoding inputs/outputs to/from them therefore doesn't lose information. That isn't the case for other functions.

I think more discussion is needed. I also think we should first focus on getting general transcoding interfaces in place before we start proposing additional overloads. That will help us identify criteria useful to determine when transcoding suffices vs when overloads are necessary (for performance or preservation of information reasons).

steve-downey commented 5 years ago

JSON requires UTF-8 at least, not just ASCII, although since the 7 bit part can look the same, it's hard to tell the difference.

"JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32."

cor3ntin commented 5 years ago

what should

const auto n = u8"Ⅷ";
int result;
std::from_chars(n, n + size(n), result);

do ?

tahonermann commented 5 years ago

what should ... do ?

Well, according to [charconv.from.chars]p3, it should do what std::strtol would do! (return in error :p )

cor3ntin commented 5 years ago

That's u8 - the standard says nothing yet :p

tahonermann commented 5 years ago

Fair point. So the future question is, what does std::u8strtol do? :p

DBJDBJ commented 4 years ago
  1. std::format is not a proposal any more: it is addopted
  2. WG21 vs JSON situation is not a good situation, otherwise char8_t would be fully implemented in C++11 and that would make standard JSON possible, natural and easy.
  3. Thus right now we can not do auto json = std::format( u8"{ \"hiragana\" : \"{ }\" }", u8"ひらがな" );
tahonermann commented 4 years ago

Thus right now we can not do auto json = std::format( u8"{ \"hiragana\" : \"{ }\" }", u8"ひらがな" );

The author of std::format intends to propose additional char8_t and friends overloads for C++23.

DBJDBJ commented 4 years ago

Good news, but I wonder how is he going to do that?

Thanks for watching ... I will keep quiet for a while now :)

tahonermann commented 4 years ago

Good news, but I wonder how is he going to do that?

I'm not sure what you mean. Nothing technical is stopping him from providing char8_t based overloads. And std::format produces a std::basic_string; presumably a char8_t overload would produce a std::u8string.

Thanks for watching ... I will keep quiet for a while now :)

No need to keep quiet (but participation via our Slack channel or mailing list will reach a larger audience; see links at https://github.com/sg16-unicode/sg16; and email me if you need a Slack invitation).

DBJDBJ commented 4 years ago

Thank's Tom, basically, I have no idea about what is in C++20 and what is not in regarding char8_t. Where I can read about it? N4835?

Also <uchar.h> <cuchar> confuses me.

Then the project management is sometimes so simple it is confusing. Everythinh seems jumbled together. It seems there is no "nice to have" -- "usefull" -- "essential", or some such categorizations of things to be done?

For example there is a language and there is a library. The library depends on the language. And it seems the language has postponed the full set of decisions untill 2023? Etc ... Ok going to Slack ...

Ixrec commented 4 years ago

I have no idea about what is in C++20 and what is not in regarding char8_t. Where I can read about it? N4835?

Unfortunately there's no official per-proposal status page online anywhere, and the unofficial posts that people put up after each standards meeting (e.g. https://www.reddit.com/r/cpp/comments/cfk9de/201907_cologne_iso_c_committee_trip_report_the/) are never totally comprehensive, though they're pretty good about the "need-to-know" big ticket items like std::format. And of course the tables in the README here should be thorough for unicode-related papers.

But as far as I know, the least bad general answer is to simply get the latest working draft of the standard (I assume https://en.cppreference.com/w/cpp/links 's link to it is kept very up to date) and ctrl+F for char8_t.

steve-downey commented 4 years ago

The draft sources are on GitHub , but the best rendering of them are https://eel.is/c++draft/

However, the standard is not a tutorial. And sorting out meaningful change is difficult.

On Mon, Nov 18, 2019, 18:37 Ixrec notifications@github.com wrote:

I have no idea about what is in C++20 and what is not in regarding char8_t. Where I can read about it? N4835?

Unfortunately there's no official per-proposal status page online anywhere, and the unofficial posts that people put up after each standards meeting (e.g. https://www.reddit.com/r/cpp/comments/cfk9de/201907_cologne_iso_c_committee_trip_report_the/) are never totally comprehensive, though they're pretty good about the "need-to-know" big ticket items like std::format. But as far as I know, the least bad answer is to simply read the latest working draft of the standard (I assume https://en.cppreference.com/w/cpp/links's link to it is kept very up to date) and ctrl+F for char8_t.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sg16-unicode/sg16/issues/38?email_source=notifications&email_token=AAVNZ5TIMPNVBPJRH4PTGW3QUMRLVA5CNFSM4GD54HJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEMJ3OY#issuecomment-555261371, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVNZ5RVY7HZ3D6A5M53OADQUMRLVANCNFSM4GD54HJA .

tahonermann commented 4 years ago

@DBJDBJ, the questions you are asking here are fine questions, but are not relevant to this github issue. Please ask them again either in Slack or on the SG16 mailing list and I'll be happy to respond there.

DBJDBJ commented 4 years ago

@lxrec that's what exactly I am/was doing ... @steve-downey ...ditto :) @tahonermann -- please do mail me an invitation to the Slack channel -- to -- dbj at dbj dot org

dascandy commented 3 years ago

@tahonermann please send me a new invite; I do not have a pending one (on this account at least, which is the one it should be).

tahonermann commented 3 years ago

@dascandy, invite reissued. Let me know if you don't see it.

cor3ntin commented 3 years ago

Something I forgot to say during the telco I think: How Important is this feature once we have explicit char/char8_t conversion? At the very least, we should get that conversion first and then see what is needed :)

tahonermann commented 3 years ago

How Important is this feature once we have explicit char/char8_t conversion?

I don't think we want char/char8_t "conversion" as that would defeat TBAA. What we've discussed is the ability to perform an explicit alias barrier cast to an underlying type within a limited scope. I think we need such a feature, but I think it should mostly be an expert-only kind of feature and something relatively ugly in usage; something that might be used to implement char8_t-based std::from_chars() and std::to_chars() as opposed to something that programmers use to call the char-based variants.

At the very least, we should get that conversion first and then see what is needed :)

I think that is reasonable (s/conversion/alias barriers a la #67).

cor3ntin commented 3 years ago

You are right s/conversion/explicit cast/