Open tahonermann opened 6 years ago
This was discovered during discussion on Twitter with Billy O'Neal over Microsoft's use of unsigned short
for wint_t
and definition of WEOF
as 0xFFFF
. The specification of std::char_traits<char16_t>
creates the same issue he faced with Microsoft's wchar_t
implementation choices.
According to the Unicode FAQ, non-characters (and U+FFFF
in particular) are valid in UTF-16 strings and cannot be used as sentinels. Relevant FAQ entries:
Because of this complicated history and confusing changes of wording in the standard over the years regarding what are now known as noncharacters, there is still considerable disagreement about their use and whether they should be considered "illegal" or "invalid" in various contexts. Particularly for implementations prior to Unicode 3.1, it should not be surprising to find legacy behavior treating U+FFFE and U+FFFF as invalid in Unicode 16-bit strings. And U+FFFF and U+10FFFF are, indeed, known to be used in various implementations as sentinels. For example, the value FFFF is used for WEOF in Windows implementations.
LWG thinks it's NAD: https://cplusplus.github.io/LWG/issue1200 GCC (@jwakely specifically) thinks it's a defect: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80624
int_least16_t
is likely to be the same size aschar16_t
, which may lead to surprising behavior, even ifeof()
is not a valid UTF-16 code unit.
To me this sounds like the decision was made without a proper understanding of the issue: all 16-bit values are valid UTF-16 code units.
The proposed wording betrays both a lack of understanding of what "permanently reserved" means (it doesn't mean it cannot be used, nor that it can be lost in interchange; only that it won't be assigned), and a misguided focus on UCS-2.
I think we should bring it up again with more expertise on hand.
Thanks for those links, Sergey. The wiki notes for the Rapperswil review of LWG1200 are available at http://wiki.edg.com/bin/view/Wg21rapperswil/LWGSubGroup1. Unfortunately, those notes don't provide any more useful information regarding their review.
I agree with LWG's position that the proposed resolutions were overspecified. But Martinho is right, it looks like LWG didn't understand the concern. They appear to admit that with their "it is not clear what problem is being solved" comment.
I think our concern is slightly different than lwg1200. It appears that lwg1200 is complaining about the situation in which the eof()
value can be held in a char_type
value regardless of whether that value constitutes a valid code unit for the encoding. Our concern is that there is no value that eof()
can return that is not a valid code unit (unless uint_least16_t
is larger than 16 bits). I think we should proceed with a new issue that explains this view point but refers to lwg1200 as a related concern.
On Wed, 25 Jul 2018 at 16:39, Tom Honermann wrote:
I think our concern is slightly different than lwg1200. It appears that lwg1200 is complaining about the situation in which the eof() value can be held in a char_type value regardless of whether that value constitutes a valid code unit for the encoding. Our concern is that there is no value that eof() can return that is not a valid code unit (unless uint_least16_t is larger than 16 bits).
Right, that's the problem I solved in GCC.
I think we should proceed with a new issue that explains this view point but refers to lwg1200 as a related concern.
Thanks, Jonathan.
Right, that's the problem I solved in GCC.
... by changing to_int_type()
to return 0xFFFD
when converting from a value that matches eof()
:
Just to be clear, this is an improvement over spurious eof()
matches, but doesn't actually solve lwg2959.
lwg2959 looks like an exact match for this issue. Thanks!
Just to be clear, this is an improvement over spurious eof() matches, but doesn't actually solve lwg2959.
... doesn't actually solve lwg2959 because the requirements for eq_int_type()
in [char.traits.require]
aren't satisfied. For c=0xFFFF
and d=0xFFFD
, eq(c,d)
is false, but eq_int_type(to_int_type(c), to_int_type(d))
is true. Nor are the requirements for eof()
satisfied since, for c=0xFFFD
, eq_int_type(eof(), to_int_type(c))
is true.
But the wording already allows char_traits<char16_t>
to meet those requirements (by treating 0xFFFF and 0xFFFD as equal, and by returning 0xFFFD from to_int_type(0xFFFF)
). No change is needed to the standard to make that true (just fixes to implementations).
The defect in the standard is the implication that there is any 16-bit value that "cannot appear as a valid UTF-16 code unit".
But the wording already allows char_traits
to meet those requirements (by treating 0xFFFF and 0xFFFD as equal, and by returning 0xFFFD from to_int_type(0xFFFF)).
If I understand correctly, you're suggesting that, for c=0xFFFF
and d=0xFFFD
, that eq(c,d)
can return true. But that would contradict [char.traits.specializations.char16_t]p2
.
Ah yes. I don't see how that can be solved without an ABI break to make int_type
larger than char_type
.
I'd like to give my comment on this.
Changing the type to 32 bit unsigned integer is the only real solution. Indeed, it changes the ABI, but I can give an argument that the impact of this change is extremely low to zero.
char_traits::int_type
and char_traits::eof()
are used only in the context of
iostreams.char_traits<char16_t>::int_type
and eof()
are meant to
be used only in the context basic_istream<char16_t>
, basic_ostream<char16_t>
, and their
derivatives.char16_t
are not really a thing, are largely unusable, thus
it is highly unlikely that anyone ever uses them in real-world production code.I will now explain why the iostreams that work with char16_t
are unusable.
There are very few operation that one can do with these iostreams.
iostream<char16_t>
char16_t
do not exists. You will get
compile-time error.ctype<char16_t>
facet.
They need this facet to call ctype_facet.widen('\n')
In theory implementations can provide ctype
In theory, yes. In theory one could use char_traits::int_type
all over the place. I'm saying that practically nobody uses that.
AFAIK there are only 4 implementations that provide modern C++ implementations, GCC, Clang, MSVC and EDG. Other implementations like Intel's are based on EDG. For the first 3 I can safely say then they don't provide facets that work with char16_t and char32_t. I'd expect EDG is the same.
Then there is Boost::Locale which claims that has experimental support for char16_t
and char32_t
facets that has to be enabled at configure-time with a preprocessor define, and is disabled by default. But, even when you explicitly set that definition, it won't work. Later in client code, when you use a char16_t
facet, you will get a link-time error. Boost locale expects that std::facets<char16_t>
exists and derives on those. Basically the boost::locale support for char16_t
is not experimental but is broken, and that is the reason it is disabled by default.
I guess a Debian code search should be done to see if there is a place that uses iostreams with char16_t
and char_traits::int_type
.
And, after all, GCC once did ABI break on std::string, and that went well with the tricks employed there.
You're not really adding useful information here.
In theory one could use
char_traits::int_type
all over the place.
That wouldn't work, because you don't control the code inside the standard library.
For the first 3 I can safely say then they don't provide facets that work with char16_t and char32_t. I'd expect EDG is the same.
EDG is a compiler front-end, it has nothing to do with providing facets, locales, iostreams etc. and similarly Clang is a compiler. What matters is the standard library implementation (and there are only three relevant implementations of that).
And, after all, GCC once did ABI break on std::string, and that went well with the tricks employed there.
Firstly, it was not a "break" because the old ABI is still present and supported, and how well it went is debatable. Secondly, that isn't an option in this case. Source: me.
Ok then. Do we all agree that char_traits<char16_t>::int_type
needs to be changed to to wider type?
I think we had agreement on that before your recent comments :) See Jonathan's comment on July 26th.
How should we proceed? Is a classic paper with number Pxxxx needed, or there is different procedure for defect reports?
@jwakely I believe there are four - Sony uses dinkumware's STL.
@dimztimz There is already an active LWG issue for this so, to some degree, this is effectively in LWG's court. A paper arguing for a particular resolution could be helpful in moving things along, but attending an LWG issues processing session could also have the same effect.
@ubsan, Microsoft uses Dinkumware's STL (heavily customized).
@tahonermann not anymore - it's a fork of dinkumware's STL that hasn't been in alignment with mainline for ~three years.
same issue with char8_t - it should return a 32 bits int whose value is not in the range 00-10FFFF This can be fixed for 20
@cor3ntin char_traits<char8_t>::eof()
only needs to return a value which is not a valid UTF-8 code unit. So any integer larger than char8_t
will work. 10FFFF cannot appear as a UTF-8 code unit.
There is no requirement that u8string stores valid utf-8 data. So now if a iostream function returns eof(), do I have invalid data or actually reach the end of the file? No way for me to know, forcing me to check the state of the string. Any 8 bit value is a utf8 code unit and can appear in input data. Same thing for utf 16 and 32 actually. But 32 is a bit complicated, I suppose we couldn't require a 64 bits value? I am missing something?
But a u8string stores 8-bit code units. An 8-bit code unit cannot have the value 10FFFF.
eof()
is compared to elements of the string, i.e. code units, not code points.
Hum, nevermind, I originally missunderstood your reply, everything is fine 🥺
Was there any progress on this issue in the last ISO meeting?
Can NB comments be given for issues?
No, we didn’t spend any time on this in cologne.
You can certainly submit an NB comment for it. I think the only potentially controversial aspect is whether fixing this is worth an ABI break. I don’t have a good sense of what the committee will feel about that.
On ABI breaks, did P1863 get discussed at Belfast? I suspect "minor" reasons for breakage like this issue stand or fall with that larger question and ought to get included in a large batch of changes if done at all.
On ABI breaks, did P1863 get discussed at Belfast?
No, it didn't. I think there is intent for it to get discussed in Prague.
For historical reference, Unicode Corrigendum #9 (Clarification About Noncharacters) clarified that noncharacter code points are not prohibited in interchange.
A temporary solution can be to deprecate the type alias int_type
and the member function eof()
for the char16_t
specialization.
Thanks, @dimztimz. Deprecating int_type
would require deprecating not_eof()
, to_char_type()
, to_int_type()
, and eq_int_type()
in addition to eof()
. We could do that, but it doesn't get us much closer to a solution. I think we're heading in the direction of just fixing int_type
as a DR. I'm going to reach out to implementors to see if they have any additional concerns and, based on their response, will poll that direction in SG16.
Per the last comment, I reached out to implementors to collect their thoughts on whether it is feasible to just fix int_type
as a DR. I received a couple of responses that I'll summarize in the github issue tracking LWG2959.
The specialization for
std::char_traits<char16_t>
requires a memberint_type
defined asuint_least16_t
and for the member functioneof()
to return "an implementation-defined constant that cannot appear as a valid UTF-16 code unit."However, all 16 bit values are valid UTF-16 code unit values, so the only way for an implementation to be conforming would be if its
uint_least16_t
type is larger than 16 bits.