sg16-unicode / sg16

SG16 overview and general information
45 stars 5 forks source link

std::char_traits<char16_t>::eof() requires uint_least16_t to be larger than 16 bits (LWG#2959) #32

Open tahonermann opened 6 years ago

tahonermann commented 6 years ago

The specialization for std::char_traits<char16_t> requires a member int_type defined as uint_least16_t and for the member function eof() to return "an implementation-defined constant that cannot appear as a valid UTF-16 code unit."

However, all 16 bit values are valid UTF-16 code unit values, so the only way for an implementation to be conforming would be if its uint_least16_t type is larger than 16 bits.

tahonermann commented 6 years ago

This was discovered during discussion on Twitter with Billy O'Neal over Microsoft's use of unsigned short for wint_t and definition of WEOF as 0xFFFF. The specification of std::char_traits<char16_t> creates the same issue he faced with Microsoft's wchar_t implementation choices.

tahonermann commented 6 years ago

According to the Unicode FAQ, non-characters (and U+FFFF in particular) are valid in UTF-16 strings and cannot be used as sentinels. Relevant FAQ entries:

Because of this complicated history and confusing changes of wording in the standard over the years regarding what are now known as noncharacters, there is still considerable disagreement about their use and whether they should be considered "illegal" or "invalid" in various contexts. Particularly for implementations prior to Unicode 3.1, it should not be surprising to find legacy behavior treating U+FFFE and U+FFFF as invalid in Unicode 16-bit strings. And U+FFFF and U+10FFFF are, indeed, known to be used in various implementations as sentinels. For example, the value FFFF is used for WEOF in Windows implementations.

cubbimew commented 6 years ago

LWG thinks it's NAD: https://cplusplus.github.io/LWG/issue1200 GCC (@jwakely specifically) thinks it's a defect: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80624

rmartinho commented 6 years ago

int_least16_t is likely to be the same size as char16_t, which may lead to surprising behavior, even if eof() is not a valid UTF-16 code unit.

To me this sounds like the decision was made without a proper understanding of the issue: all 16-bit values are valid UTF-16 code units.

The proposed wording betrays both a lack of understanding of what "permanently reserved" means (it doesn't mean it cannot be used, nor that it can be lost in interchange; only that it won't be assigned), and a misguided focus on UCS-2.

I think we should bring it up again with more expertise on hand.

tahonermann commented 6 years ago

Thanks for those links, Sergey. The wiki notes for the Rapperswil review of LWG1200 are available at http://wiki.edg.com/bin/view/Wg21rapperswil/LWGSubGroup1. Unfortunately, those notes don't provide any more useful information regarding their review.

I agree with LWG's position that the proposed resolutions were overspecified. But Martinho is right, it looks like LWG didn't understand the concern. They appear to admit that with their "it is not clear what problem is being solved" comment.

I think our concern is slightly different than lwg1200. It appears that lwg1200 is complaining about the situation in which the eof() value can be held in a char_type value regardless of whether that value constitutes a valid code unit for the encoding. Our concern is that there is no value that eof() can return that is not a valid code unit (unless uint_least16_t is larger than 16 bits). I think we should proceed with a new issue that explains this view point but refers to lwg1200 as a related concern.

jwakely commented 6 years ago

On Wed, 25 Jul 2018 at 16:39, Tom Honermann wrote:

I think our concern is slightly different than lwg1200. It appears that lwg1200 is complaining about the situation in which the eof() value can be held in a char_type value regardless of whether that value constitutes a valid code unit for the encoding. Our concern is that there is no value that eof() can return that is not a valid code unit (unless uint_least16_t is larger than 16 bits).

Right, that's the problem I solved in GCC.

I think we should proceed with a new issue that explains this view point but refers to lwg1200 as a related concern.

See https://cplusplus.github.io/LWG/lwg-active.html#2959

tahonermann commented 6 years ago

Thanks, Jonathan.

Right, that's the problem I solved in GCC.

... by changing to_int_type() to return 0xFFFD when converting from a value that matches eof():

Just to be clear, this is an improvement over spurious eof() matches, but doesn't actually solve lwg2959.

See https://cplusplus.github.io/LWG/lwg-active.html#2959

lwg2959 looks like an exact match for this issue. Thanks!

tahonermann commented 6 years ago

Just to be clear, this is an improvement over spurious eof() matches, but doesn't actually solve lwg2959.

... doesn't actually solve lwg2959 because the requirements for eq_int_type() in [char.traits.require] aren't satisfied. For c=0xFFFF and d=0xFFFD, eq(c,d) is false, but eq_int_type(to_int_type(c), to_int_type(d)) is true. Nor are the requirements for eof() satisfied since, for c=0xFFFD, eq_int_type(eof(), to_int_type(c)) is true.

jwakely commented 6 years ago

But the wording already allows char_traits<char16_t> to meet those requirements (by treating 0xFFFF and 0xFFFD as equal, and by returning 0xFFFD from to_int_type(0xFFFF)). No change is needed to the standard to make that true (just fixes to implementations).

The defect in the standard is the implication that there is any 16-bit value that "cannot appear as a valid UTF-16 code unit".

tahonermann commented 6 years ago

But the wording already allows char_traits to meet those requirements (by treating 0xFFFF and 0xFFFD as equal, and by returning 0xFFFD from to_int_type(0xFFFF)).

If I understand correctly, you're suggesting that, for c=0xFFFF and d=0xFFFD, that eq(c,d) can return true. But that would contradict [char.traits.specializations.char16_t]p2.

jwakely commented 6 years ago

Ah yes. I don't see how that can be solved without an ABI break to make int_type larger than char_type.

dimztimz commented 5 years ago

I'd like to give my comment on this.

Changing the type to 32 bit unsigned integer is the only real solution. Indeed, it changes the ABI, but I can give an argument that the impact of this change is extremely low to zero.

  1. char_traits::int_type and char_traits::eof() are used only in the context of iostreams.
  2. That means char_traits<char16_t>::int_type and eof() are meant to be used only in the context basic_istream<char16_t>, basic_ostream<char16_t>, and their derivatives.
  3. Iostreams with type char16_t are not really a thing, are largely unusable, thus it is highly unlikely that anyone ever uses them in real-world production code.
  4. Since nobody uses them, any changes there are highly unlikely to have an impact.

I will now explain why the iostreams that work with char16_t are unusable. There are very few operation that one can do with these iostreams.

  1. You CAN construct iostream<char16_t>
  2. You CAN NOT call any of the formatted io operations because all depend on locale facets, and facets with char16_t do not exists. You will get compile-time error.
  3. You CAN NOT call some of the unformatted io operations, those that work with the newline character. Those depend on ctype<char16_t> facet. They need this facet to call ctype_facet.widen('\n')
  4. You CAN only call the unformatted io functions that do not depend on any locale facet. That leaves very few functions of the whole iostream.
jwakely commented 5 years ago

In theory implementations can provide ctype and other facets. If the implementation doesn't, users can define their own facets and imbue a custom locale into streams.

dimztimz commented 5 years ago

In theory, yes. In theory one could use char_traits::int_type all over the place. I'm saying that practically nobody uses that.

AFAIK there are only 4 implementations that provide modern C++ implementations, GCC, Clang, MSVC and EDG. Other implementations like Intel's are based on EDG. For the first 3 I can safely say then they don't provide facets that work with char16_t and char32_t. I'd expect EDG is the same.

Then there is Boost::Locale which claims that has experimental support for char16_t and char32_t facets that has to be enabled at configure-time with a preprocessor define, and is disabled by default. But, even when you explicitly set that definition, it won't work. Later in client code, when you use a char16_t facet, you will get a link-time error. Boost locale expects that std::facets<char16_t> exists and derives on those. Basically the boost::locale support for char16_t is not experimental but is broken, and that is the reason it is disabled by default.

I guess a Debian code search should be done to see if there is a place that uses iostreams with char16_t and char_traits::int_type.

And, after all, GCC once did ABI break on std::string, and that went well with the tricks employed there.

jwakely commented 5 years ago

You're not really adding useful information here.

In theory one could use char_traits::int_type all over the place.

That wouldn't work, because you don't control the code inside the standard library.

For the first 3 I can safely say then they don't provide facets that work with char16_t and char32_t. I'd expect EDG is the same.

EDG is a compiler front-end, it has nothing to do with providing facets, locales, iostreams etc. and similarly Clang is a compiler. What matters is the standard library implementation (and there are only three relevant implementations of that).

And, after all, GCC once did ABI break on std::string, and that went well with the tricks employed there.

Firstly, it was not a "break" because the old ABI is still present and supported, and how well it went is debatable. Secondly, that isn't an option in this case. Source: me.

dimztimz commented 5 years ago

Ok then. Do we all agree that char_traits<char16_t>::int_type needs to be changed to to wider type?

tahonermann commented 5 years ago

I think we had agreement on that before your recent comments :) See Jonathan's comment on July 26th.

dimztimz commented 5 years ago

How should we proceed? Is a classic paper with number Pxxxx needed, or there is different procedure for defect reports?

strega-nil commented 5 years ago

@jwakely I believe there are four - Sony uses dinkumware's STL.

tahonermann commented 5 years ago

@dimztimz There is already an active LWG issue for this so, to some degree, this is effectively in LWG's court. A paper arguing for a particular resolution could be helpful in moving things along, but attending an LWG issues processing session could also have the same effect.

@ubsan, Microsoft uses Dinkumware's STL (heavily customized).

strega-nil commented 5 years ago

@tahonermann not anymore - it's a fork of dinkumware's STL that hasn't been in alignment with mainline for ~three years.

cor3ntin commented 5 years ago

same issue with char8_t - it should return a 32 bits int whose value is not in the range 00-10FFFF This can be fixed for 20

jwakely commented 5 years ago

@cor3ntin char_traits<char8_t>::eof() only needs to return a value which is not a valid UTF-8 code unit. So any integer larger than char8_t will work. 10FFFF cannot appear as a UTF-8 code unit.

cor3ntin commented 5 years ago

There is no requirement that u8string stores valid utf-8 data. So now if a iostream function returns eof(), do I have invalid data or actually reach the end of the file? No way for me to know, forcing me to check the state of the string. Any 8 bit value is a utf8 code unit and can appear in input data. Same thing for utf 16 and 32 actually. But 32 is a bit complicated, I suppose we couldn't require a 64 bits value? I am missing something?

jwakely commented 5 years ago

But a u8string stores 8-bit code units. An 8-bit code unit cannot have the value 10FFFF.

eof() is compared to elements of the string, i.e. code units, not code points.

cor3ntin commented 5 years ago

Hum, nevermind, I originally missunderstood your reply, everything is fine 🥺

dimztimz commented 5 years ago

Was there any progress on this issue in the last ISO meeting?

Can NB comments be given for issues?

tahonermann commented 5 years ago

No, we didn’t spend any time on this in cologne.

You can certainly submit an NB comment for it. I think the only potentially controversial aspect is whether fixing this is worth an ABI break. I don’t have a good sense of what the committee will feel about that.

Ixrec commented 4 years ago

On ABI breaks, did P1863 get discussed at Belfast? I suspect "minor" reasons for breakage like this issue stand or fall with that larger question and ought to get included in a large batch of changes if done at all.

tahonermann commented 4 years ago

On ABI breaks, did P1863 get discussed at Belfast?

No, it didn't. I think there is intent for it to get discussed in Prague.

tahonermann commented 9 months ago

For historical reference, Unicode Corrigendum #9 (Clarification About Noncharacters) clarified that noncharacter code points are not prohibited in interchange.

dimztimz commented 9 months ago

A temporary solution can be to deprecate the type alias int_type and the member function eof() for the char16_t specialization.

tahonermann commented 9 months ago

Thanks, @dimztimz. Deprecating int_type would require deprecating not_eof(), to_char_type(), to_int_type(), and eq_int_type() in addition to eof(). We could do that, but it doesn't get us much closer to a solution. I think we're heading in the direction of just fixing int_type as a DR. I'm going to reach out to implementors to see if they have any additional concerns and, based on their response, will poll that direction in SG16.

tahonermann commented 5 months ago

Per the last comment, I reached out to implementors to collect their thoughts on whether it is feasible to just fix int_type as a DR. I received a couple of responses that I'll summarize in the github issue tracking LWG2959.