sg16-unicode / sg16

SG16 overview and general information
45 stars 5 forks source link

A correct codecvt facet that works with basic_filebuf can't do UTF conversions #33

Open BillyONeal opened 6 years ago

BillyONeal commented 6 years ago

UTF-8 <-> UTF-16 is prohibited by http://eel.is/c++draft/locale.codecvt#virtuals-3, since 4 UTF-8 encoding units can produce 2 UTF-16 encoding units.

Unfortunately it is not so easy to fix as we would need the counterpart to do_unshift on the "wide" side of the interface which would be massively ABI breaking.

cubbimew commented 6 years ago

I've seen opensource use of std::codecvt_utf8_utf16 with file streams -- maybe the authors are unaware that their code is non-portable, since it works with libstdc++ (fails with libc++ and MSVC). Here's a demo: http://coliru.stacked-crooked.com/a/dd397ba886d9bbff

Can't the standard acknowledge that an implementation is allowed to do the right thing with "otherwise, the behavior is implementation-defined"? Or, perhaps introduce std::u16stream and std::u16filebuf that aren't tied to this restriction?

tahonermann commented 6 years ago

I've been aware of this limitation for some time, but I've never researched why the limitation exists. Is there something fundamental about std::basic_filebufthat doesn't work for MxN transcoders? I assume not since libstdc++ apparently works as expected with such transcoders. Perhaps the limitation exists in the standard simply because the desired behavior doesn't match (some) existing implementations?

Regardless, my answer for this is to deprecate std::codecvt in favor of some yet-to-be-determined replacement. My inclination is towards something like text_view, though that doesn't fully suffice as a replacement since text_view doesn't support writeable views (it supports read-only views and write-only iterators), nor can a stream be imbued with such a view (I'm so far unconvinced that a replacement for imbue() functionality is needed or desirable).

BillyONeal commented 6 years ago

When people say "libstdc++ apparently works" have people actually verified it does the right thing when libstdc++ needs to cross a buffer boundary, or needs to seek in the file, etc.? I think these edge cases are solvable but unless they go out of their way to document that they support it I would be cautious. Triggering a surrogate pair split right on a buffer boundary, for example, needs to be handled correctly for correct behavior but is, at the very least, tricky.

text_view looks interesting, but it looks like it operates on a codepoint-by-codepoint basis rather than through bulk submission, so I'd expect performance to be poor. High quality UTF encoders use techniques to process more than one encoding unit at a time (e.g. most UTF-8 texts have long strings of 7 bit ASCII in them, and a vectorized version of that to UTF-16 is easy-ish to write). The paper claims codecvt is expensive due to the virtual function call, but so long as the virtual call is done on whole buffers rather than on single codepoints that virtual call should be noise; as UTF decoding itself is a fairly expensive operation.

Honestly, I think the ideal interface looks very much like codecvt with a better way of reporting errors and removal of the 1:N restriction. Unfortunately lifting the 1:N restriction is an ABI break for MSVC++ because we store only a single wchar_t inside the basic_filebuf when a codecvt facet is engaged :(

tahonermann commented 6 years ago

When people say "libstdc++ apparently works" have people actually verified it does the right thing when libstdc++ needs to cross a buffer boundary, or needs to seek in the file, etc.? I think these edge cases are solvable but unless they go out of their way to document that they support it I would be cautious. Triggering a surrogate pair split right on a buffer boundary, for example, needs to be handled correctly for correct behavior but is, at the very least, tricky.

Good question, @jwakely, any comments on this?

text_view looks interesting, but it looks like it operates on a codepoint-by-codepoint basis rather than through bulk submission, so I'd expect performance to be poor.

This is definitely a concern. I've long intended to provide transcoding interfaces that would enable optimizations like you described when invoked with a contiguous range (or segmented contiguous ranges). Alas, I've made no progress on that front.

BillyONeal commented 6 years ago

Too bad it's easier for people like me to just complain about the problem rather than actually solve it, right? :P Here are some benchmarks if it helps https://gist.github.com/BillyONeal/72dcde394758d4f9d82324774b8107e4

tahonermann commented 6 years ago

Too bad it's easier for people like me to just complain about the problem rather than actually solve it, right? :P

I think we all play that role more often than not ;)

Here are some benchmarks if it helps https://gist.github.com/BillyONeal/72dcde394758d4f9d82324774b8107e4

Ohhh, neat, thanks! I'm assuming you're familiar with Bob Steagall's recent work?

jwakely commented 6 years ago

If libstdc++ gets it right it's not my doing, and I can't really comment on that code. It's indistinguishable from magic as far as I'm concerned.

BillyONeal commented 6 years ago

Ohhh, neat, thanks! I'm assuming you're familiar with Bob Steagall's recent work?

I've seen the talk but not attempted implementing his algorithm yet. I wrote these a long time ago when I was trying to replace the guts of our codecvt facets for ABI-broken-world, but ABI-broken-world seems to only get further way so I had to stop that work for some time.