Open llvmbot opened 10 years ago
I can reproduce this; it will almost certainly take a change in the dylib to fix.
int main() { { const char src[] = "\xFE\xFF\xAB\xCD"; std::wstring_convert<std::codecvt_utf16<char16_t, 0x10FFFF, std::consume_header>, char16_t> conv; auto dst = conv.from_bytes(src, src+4); if (dst[0] != 0xabcd) std::cout << "Failed to process BOM from UTF-16BE string when" " expecting BE\nRead " << std::hex << (unsigned)dst[0] << '\n'; } { const char src[] = "\xFF\xFE\xAB\xCD\0"; std::wstring_convert<std::codecvt_utf16<char16_t, 0x10FFFF, std::consume_header>, char16_t> conv; auto dst = conv.from_bytes(src, src+4); if (dst[0] != 0xcdab) std::cout << "Failed to process BOM from UTF-16LE string when" " expecting BE\nRead " << std::hex << (unsigned)dst[0] << '\n'; } { const char src[] = "\xFE\xFF\xAB\xCD\0"; std::wstring_convert<std::codecvt_utf16<char16_t, 0x10FFFF, std::codecvt_mode(std::consume_header|std::little_endian)>, char16_t> conv; auto dst = conv.from_bytes(src, src+4); if (dst[0] != 0xabcd) std::cout << "Failed to process BOM from UTF-16BE string when" " expecting LE\nRead " << std::hex << (unsigned)dst[0] << '\n'; } { const char src[] = "\xFF\xFE\xAB\xCD\0"; std::wstring_convert<std::codecvt_utf16<char16_t, 0x10FFFF, std::codecvt_mode(std::consume_header|std::little_endian)>, char16_t> conv; auto dst = conv.from_bytes(src, src+4); if (dst[0] != 0xcdab) std::cout << "Failed to process BOM from UTF-16LE string when" " expecting LE\nRead " << std::hex << (unsigned)dst[0] << '\n'; } }
This prints:
Failed to process BOM from UTF-16LE string when expecting BE Read fffe Failed to process BOM from UTF-16BE string when expecting LE Read fffe
i.e. it fails "to determine the endianness of the subsequent multibyte sequence to be read", instead it unconditionally assumes the endianness is the same as the codecvt_mode template argument.
It seems that libc++ treats "consume_header" as "optionally discard a BOM that matches my current endianness" instead of "optionally read a BOM to determine the endianness"
i.e. you have to create the facet with the correct endianness in the first place, to be able to "consume" the BOM.
The standard says:
— If (Mode & consume_header), the facet shall consume an initial header sequence, if present, when reading a multibyte sequence to determine the endianness of the subsequent multibyte sequence to be read.
i.e. the facet must be able to switch endianness based on the BOM, not fix it at compile-time.
Extended Description
Hi, It seems, the facet codecvt_utf16 is buggy.
I've a file which is UTF-16LE encoded with BOM. I define 2 facets like this: std::locale utf16_locale( std::locale(), new std::codecvt_utf16< wchar_t, 0x10ffff, std::consume_header > ); std::locale utf16_localeLEBOM( std::locale(), new std::codecvt_utf16< wchar_t, 0x10ffff, static_cast< std::codecvt_mode >( 5 ) /std::little_endian | std::consume_header/ > );
I then define 2 streams like this: std::wifstream xmlfile1; std::wifstream xmlfile2;
And read them like this: { std::istreambuf_iterator< wchar_t > eos; xmlfile1.imbue( utf16_locale ); xmlfile1.open( szFilename, std::ios::in | std::ios::binary ); std::wstring strXML( std::istreambuf_iterator< wchar_t >( xmlfile1 ), eos ); } { std::istreambuf_iterator< wchar_t > eos; xmlfile2.imbue( utf16_localeLEBOM ); xmlfile2.open( szFilename, std::ios::in | std::ios::binary ); std::wstring strXML( std::istreambuf_iterator< wchar_t >( xmlfile2 ), eos ); }
With the first one, the BOM isn't recognized/analyzed, and I end up with a strXML containing the "raw" content of the file, including the BOM! With the second one, I end up with the correct string, as I can read it with a text editor.
If need, I can provide a complete package with test-file. Thx Fred