sg16-unicode / sg16

SG16 overview and general information

45 stars 5 forks source link

Support for UTF encodings in std::format() and std::print() #68

Open tahonermann opened 3 years ago

tahonermann commented 3 years ago

std::format() (in C++20) and std::print() (proposed in P2093) do not allow char8_t, char16_t, and char32_t based strings to be used for either the format string or for field arguments.

There are two distinct concerns:

If UTF strings are allowed as format strings, what conversions are performed on char and wchar_t based field arguments?
```
std::string s = ...;
std::format(u"{}", s);
```
If UTF strings are allowed as field arguments, what conversions are performed when the format string is char or wchar_t based?
```
std::u16string s = ...;
std::format("{}", s);
```
The answers to those questions may be dependent on one or both of:

The literal encoding (execution character set) selected at compile-time (as is proposed in P2093).
The locale dependent system encoding selected at run-time.

jensmaurer commented 3 years ago

On 18/03/2021 21.53, Tom Honermann wrote:

|std::format()| (in C++20) and |std::print()| (proposed in P2093 https://wg21.link/p2093) do not allow |char8_t|, |char16_t|, and |char32_t| based strings to be used for either the format string or for field arguments.

There are two distinct concerns:

If UTF strings are allowed as formatter strings, what conversions are performed on |char| and |wchar_t| based field arguments?

|std::string s = ...; std::format(u"{}", s); |

If UTF strings are allowed as field arguments, what conversions are performed when the format string is |char| or |wchar_t| based?

|std::u16string s = ...; std::format("{}", s); |

The answers to those questions may be dependent on one or both of:

The literal encoding (execution character set) selected at compile-time (as is proposed in P2093 https://wg21.link/p2093).

The locale dependent system encoding selected at run-time.

printf can take a wchar_t string, and wprintf can take a char string. In the first case, wcrtomb is used to convert, in the second case, mbrtowc is used to convert.

For case 1 above, this seems to suggest that "s" is assumed to be in the encoding that mbrtowc or mbrtowc would expect as input (presumably, the locale-dependent (multibyte) encoding for w/char strings), and the formatting produces a UTF-8 string as output.

For case 2 above, the result should be a w/char string, so "s" needs to be converted to the respective (runtime) encoding. It seems that the literal encoding is not relevant at all, and one can just hope that the encoding of the literal format string happens to just work when interpreted under the runtime encoding. (Limiting the format string to the basic character set will probably help, in practice. For example, ASCII -> UTF-8 works, as does ASCII -> ISO 8859-x, but ASCII -> EBCDIC doesn't.)

Jens

tahonermann commented 3 years ago

For case 1 above, this seems to suggest that "s" is assumed to be in the encoding that mbrtowc or mbrtowc would expect as input (presumably, the locale-dependent (multibyte) encoding for w/char strings), and the formatting produces a UTF-8 string as output.

It may suggest that, and assumptions would have to be made. The conclusion that UTF-8 output is produced does not resonate with me. That certainly would not be the case on an EBCDIC based system.

It seems that the literal encoding is not relevant at all, and one can just hope that the encoding of the literal format string happens to just work when interpreted under the runtime encoding.

That hope is certainly required in many cases, but it may be reasonable to make decisions based on literal encoding. For example, if the literal encoding is UTF-8, it may be reasonable to only claim conformance if the application is run in a UTF-8 environment. In that case, locale dependencies can be avoided.

tahonermann commented 3 years ago

Here is a possible model that takes into account both the literal encoding and the locale dependent run-time encoding.

If the format string is UTF-based, then:
- char8_t, char16_t, and char32_t based field arguments are converted to the (UTF) encoding of the format string (not locale dependent).
- char based field arguments are converted as follows:
- If the literal encoding is a UTF encoding, conversion is from that (UTF) encoding to the (UTF) encoding of the format string (not locale dependent).
- Otherwise, conversion is as if by, for example, mbrtoc16() (locale dependent).
- wchar_t based field arguments are converted as follows:
- If the wide literal encoding is a UTF encoding, conversion is from that (UTF) encoding to the (UTF) encoding of the format string (not locale dependent).
- Otherwise, conversion is as if by, for example, wcrtoc16() (if such a conversion function existed; locale dependent).
Otherwise, if the format string is char based:
- If the literal encoding is a UTF encoding:
- char8_t, char16_t, and char32_t based field arguments are converted to that (UTF) encoding (not locale dependent).
- wchar_t based field arguments are converted as follows:
  - If the wide literal encoding is a UTF encoding, conversion is from that (UTF) encoding to the (UTF) encoding of the format string (not locale dependent).
  - Otherwise, conversion is as if by, for example, a char based wcrtoc8() (if such a conversion function existed; locale dependent).
- Otherwise:
- char8_t, char16_t, and char32_t based field arguments are converted as if by, for example, c16rtomb() (locale dependent).
- wchar_t based field arguments are converted as follows:
  - If the wide literal encoding is a UTF encoding, conversion is as if by, for example, a wchar_t based c16rtomb() (if such a conversion function existed; locale dependent).
  - Otherwise, conversion is as if by wcrtomb() (locale dependent).
Otherwise (the format string is wchar_t based):
- If the wide literal encoding is a UTF encoding:
- char8_t, char16_t, and char32_t based field arguments are converted to that (UTF) encoding (not locale dependent).
- char based field arguments are converted as follows:
  - If the literal encoding is a UTF encoding, conversion is from that (UTF) encoding to the (UTF) encoding of the format string (not locale dependent).
  - Otherwise, conversion is as if by, for example, a wchar_t based mbrtoc16() (if such a conversion function existed; locale dependent).
- Otherwise:
- char8_t, char16_t, and char32_t based field arguments are converted as if by, for example, c16rtowc() (if such a conversion function existed; locale dependent).
- char based field arguments are converted as follows:
  - If the literal encoding is a UTF encoding, conversion is as if by, for example, a char based c8rtowc() (if such a conversion function existed; locale dependent).
  - Otherwise, conversion is as if by mbrtowc() (locale dependent).

peter-b commented 3 years ago

I am no longer intending to pursue this direction.

zoran12 commented 11 months ago

I would like to say as C++ programmer that not having support for char8_t, char16_t, and char32_t for std::format is rly bad and honestly I have hard time to see why... If there are only this two distinct concerns then solution is simple, don't let conversion happens at all if its not possible, if format text is char8_t then all field arguments need to be char8_t too...

I think for wide char its like that atm as was playing with sdt::format code as this clearly say ( line 3594 format.h ): // not using the macro because we'd like to avoid the formatter<wchar_t, char> specialization template <> struct formatter<wchar_t, wchar_t> : _Formatter_base<wchar_t, wchar_t, _Basic_format_arg_type::_Char_type> {};

To create custom std::format function that would support all character types just need few code duplication atm ( at least on windows ) and I made std::format working for all character types ( didn't test chrono formatting only as that define function declaration for chrono formatting is only one left unchanged as it would required changes to code where its used too )

In my opinion its so simple to add support for this new characters type as all support is already there, char8_t can use 100% code that's in for char as for formatting char we are assuming char is UTF8 formatted and not ASCII that's clearly seen in this function : _NODISCARD constexpr _Decode_result _Decode_utf(const char _First, const char _Last, char32_t& _Val) noexcept

Support for char16_t is also in there as its same for windows wchar_t, they are 100% same as when you create new string with char16_t type VS will report like its text created with L"" prefix even if its created with u"" prefix ....

Support for char32_t is already in format code as there is _Decode_utf function for char32_t already in code and on Mac wide character is 32 bit so that also had to work ( and I assume std::format code is same for all platforms with just #define doing different code parts depending of OS differences )

99.99% of usage of format code is at my opinion on character that are same type and they are the most efficient ones, so please just enable it support for std::format and all difficulty of conversion is better to be left to new class like std::convert and then let user convert all text if needed before used for formatter. I mean atm std::format for simple plain ASCII text is doing full UTF8 decoding/encoding that's just silly I would say....

Thanks in advance, and if I can help in any way I will do, as this is rly bad for C++ not to have supported this, as main power of C++ is its ability to work on lots of platforms and like main issue in making portable code much safer and compact is C++ text support that should be much better then its now unfortunately ....

tahonermann commented 11 months ago

Thank you for your comments, @zoran12. I also agree that this is important.

The simple answer for why there has been no progress adding support for the charN_t types is that no one has done the work yet to bring forward a proposal. If you would like to help, here are a few things you could do:

Post patches to https://github.com/fmtlib/fmt to add support. This would help to identify dependencies that are likewise missing charN_t support; like charN_t specializations of std::locale facets as I pointed out at https://stackoverflow.com/a/77255029/11634221.
Draft a proposal following the process at https://isocpp.org/std/submit-a-proposal.
Communicate progress and solicit feedback on the above via the SG16 mailing list.

zoran12 commented 11 months ago

Thanks m8 on response, I would like to help, don't think I will be able to make proposal as this papers rly need to be technical but I hope I can do work in code that can provide solution or at least help find one...

for FMT its already support all character types, with fmt::format you can now format char8_t, char16_t and char32_t text, I used it 1st and tried to make it to be only header version as my engine code I am transferring to be only headers ( so you just import one header and #define one cpp to be ENGINE_CPP and all should work perfectly ). Didn't notice anything missing that I could help there, all seems to work, I still even have it inside project included as header only and its work fine with all char types ...

With experience from transferring FMT to be header only, and when sow that same person created FMT and std::format code I just tried to see if its possible to enable std::format to work with all character types and almost done it with just adding missing template versions for new char types when hit a wall with >> C2491: 'std::numpunct<_Elem>::id' : definition of dllimport static data member not allowed << error...

Realize that I cant just add code like this I just created new file that I am going to attach here now : stdFormat.zip

This file is full copy/past std format.h file so I could edit changes directly and also copied code for new numpunct class to be used so I could avoid that error ( original class is in xlocnum.h file ). New class is copy/pasted old class with only few changes like original class has this line that was creating issue: static_assert(!_ENFORCE_FACET_SPECIALIZATIONS || _Is_any_of_v<_Elem, char, wchar_t>, _FACET_SPECIALIZATION_MESSAGE); so new class had this instead: static_assert(!_ENFORCE_FACET_SPECIALIZATIONS || std::_Is_any_of_v<_Elem, char, wchar_t, char8_t, char16_t, char32_t>, _FACET_SPECIALIZATION_MESSAGE); I am not sure if this was changes that was needed to be done for missing support but they were causing that error that couldn't be solved with just adding missing template specialization for char8_t, char16_t and char32_t ...

All other changes were simple ones and like all were just to add specialization classes and function for char8_t, char16_t and char32_t that are just copy/paste of already one found in code and with changing types like from char to char8_t...

I honestly didn't expect this to work, but it worked and seems to work perfectly :

001 002

I uploaded stdFormat.hpp file ( in a zip as couldn't upload hpp file type ) just including it and using fmt::stdvFormat or fmt::stdcFormat function would let format now any character type, also all other functions are created like this and are working for all character types ( char, wchar_t, char8_t, char16_t, char32_t ) now. Only limit is that you can only use one character type for a format text and all field texts, as no formatter that would do conversion are added as wasn't sure how they should work ( I think that ascii char to all other types is created by default as soon as I changed support character types to include all types .... ) Only one thing I didn't change was chrono part :

// _STATICALLY_WIDEN is used by since C++20 and by since C++23. // It's defined here, so that both headers can use this definition.

define _STATICALLY_WIDEN(_CharT, _Literal) (_Choose_literal<_CharT>(_Literal, L##_Literal))

that was using this function:

template _NODISCARD constexpr const _CharT _Choose_literal( const char const _Str, const wchar_t* const _WStr )noexcept { if constexpr( std::is_same_v<_CharT, char>) { return _Str; } else { return _WStr; } };

As it could require changes in chrono files to use new version so this changes would need to go to original format.h file... This change would also be simple with adding support for other 3 character type same as its added for wide char.

So FMT library already supports formatting all character types and even std version in format.h is like having 90% of its support done, so its kind of shame not to have it be finished and finally have at least std::format working for all character types :)

tahonermann commented 11 months ago

@zoran12, I'll reply to a few points below, but this github issue is intended more for administrative purposes than for discussion. For further discussion, please post to the SG16 mailing list.

for FMT its already support all character types, with fmt::format you can now format char8_t, char16_t and char32_t text,

I understand that you were able to modify fmtlib to work for your purposes, but support for these types is not present in fmtlib at present as exhibited by https://godbolt.org/z/TbKdGb1hW. One way that you could help is to submit code changes and tests to fmtlib to add such support for everyone. That would be a great first step towards standardization.

Please note that there isn't just one std::format() implementation. Microsoft provides one, another one is provided by libstdcxx with gcc, and yet another one is provided by libc++ with clang/LLVM.

The prototyping you've done demonstrates that some locale enhancements are needed in the standard library to make this work. Per [locale.numpunct.general]p1, implementations are only required to provide the numpunct<char> and numpunct<wchar_t> specializations. This is also true for other locale facets. See the [locale.category.facets] table.

The standard will need to be updated to add (at least) the missing locale facet specializations before support for other character types can be specified for std::format() in the standard.

zoran12 commented 11 months ago

Oh sorry m8 for posting here, this is gonna be last message as I am not sure how to post to the SG16 mailing list. I am receiving SG 16 mails as subscribed so only need to send mail to sg16@lists.isocpp.org ? And make sure I put proper subject text for better mail handling ? Hmm mail communication like this isn't best and make people miss things, like you probably missed that fmt added support for all char types ( only that support is added in new header file call xchar.h ) So to make your example work on compile explorer a simple header including >>#include <fmt/xchar.h><< make it happened. Cant generate link so will post screenshot :

004

Best would be when you confirm that fmt works to maybe start new subject in SG16 mailing list about implementing it changes to std::format or at least to get info about its state from person who wrote it ( Victor Zverovich ) as I think he is also on this list and I will do my best to include myself inside SG16 mailing conversation if I see I can help in any way :) Only hope that at least making you aware of fmt state was worth mess I made here ( if you can remove/delete all my posts please do it so mess get away :) )

tahonermann commented 11 months ago

@zoran12,

I am receiving SG 16 mails as subscribed so only need to send mail to sg16@lists.isocpp.org ? And make sure I put proper subject text for better mail handling?

Exactly. I understand that mailing lists are unfamiliar to some people, but there is no substitute that can scale to the volume of information some of us process daily.

Thank you for the information about fmt/xchar.h; I was not aware of that addition. The example I previously linked does indeed build successfully with the addition of that header file. https://godbolt.org/z/b9boK718E. I'm curious how it handles the locale dependencies.

I'm not sure why you are unable to generate compiler explorer links. Just click the 'Share' drop down menu and then click 'Short link' and copy the URL.

I know Victor well; he has been a contributor to SG16 for many years now :)