llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.05k stars 11.58k forks source link

codecvt_utf16 facet with utf-16LE-bom file... #20855

Open llvmbot opened 10 years ago

llvmbot commented 10 years ago
Bugzilla Link 20481
Version 4.0
OS Linux
Reporter LLVM Bugzilla Contributor
CC @mclow,@tstellar,@jwakely

Extended Description

Hi, It seems, the facet codecvt_utf16 is buggy.

I've a file which is UTF-16LE encoded with BOM. I define 2 facets like this: std::locale utf16_locale( std::locale(), new std::codecvt_utf16< wchar_t, 0x10ffff, std::consume_header > ); std::locale utf16_localeLEBOM( std::locale(), new std::codecvt_utf16< wchar_t, 0x10ffff, static_cast< std::codecvt_mode >( 5 ) /std::little_endian | std::consume_header/ > );

I then define 2 streams like this: std::wifstream xmlfile1; std::wifstream xmlfile2;

And read them like this: { std::istreambuf_iterator< wchar_t > eos; xmlfile1.imbue( utf16_locale ); xmlfile1.open( szFilename, std::ios::in | std::ios::binary ); std::wstring strXML( std::istreambuf_iterator< wchar_t >( xmlfile1 ), eos ); } { std::istreambuf_iterator< wchar_t > eos; xmlfile2.imbue( utf16_localeLEBOM ); xmlfile2.open( szFilename, std::ios::in | std::ios::binary ); std::wstring strXML( std::istreambuf_iterator< wchar_t >( xmlfile2 ), eos ); }

With the first one, the BOM isn't recognized/analyzed, and I end up with a strXML containing the "raw" content of the file, including the BOM! With the second one, I end up with the correct string, as I can read it with a text editor.

If need, I can provide a complete package with test-file. Thx Fred

mclow commented 7 years ago

I can reproduce this; it will almost certainly take a change in the dylib to fix.

3a17e7c4-fca4-4827-b2a7-11a54d73d746 commented 7 years ago

include

include

include

int main() { { const char src[] = "\xFE\xFF\xAB\xCD"; std::wstring_convert<std::codecvt_utf16<char16_t, 0x10FFFF, std::consume_header>, char16_t> conv; auto dst = conv.from_bytes(src, src+4); if (dst[0] != 0xabcd) std::cout << "Failed to process BOM from UTF-16BE string when" " expecting BE\nRead " << std::hex << (unsigned)dst[0] << '\n'; } { const char src[] = "\xFF\xFE\xAB\xCD\0"; std::wstring_convert<std::codecvt_utf16<char16_t, 0x10FFFF, std::consume_header>, char16_t> conv; auto dst = conv.from_bytes(src, src+4); if (dst[0] != 0xcdab) std::cout << "Failed to process BOM from UTF-16LE string when" " expecting BE\nRead " << std::hex << (unsigned)dst[0] << '\n'; } { const char src[] = "\xFE\xFF\xAB\xCD\0"; std::wstring_convert<std::codecvt_utf16<char16_t, 0x10FFFF, std::codecvt_mode(std::consume_header|std::little_endian)>, char16_t> conv; auto dst = conv.from_bytes(src, src+4); if (dst[0] != 0xabcd) std::cout << "Failed to process BOM from UTF-16BE string when" " expecting LE\nRead " << std::hex << (unsigned)dst[0] << '\n'; } { const char src[] = "\xFF\xFE\xAB\xCD\0"; std::wstring_convert<std::codecvt_utf16<char16_t, 0x10FFFF, std::codecvt_mode(std::consume_header|std::little_endian)>, char16_t> conv; auto dst = conv.from_bytes(src, src+4); if (dst[0] != 0xcdab) std::cout << "Failed to process BOM from UTF-16LE string when" " expecting LE\nRead " << std::hex << (unsigned)dst[0] << '\n'; } }

This prints:

Failed to process BOM from UTF-16LE string when expecting BE Read fffe Failed to process BOM from UTF-16BE string when expecting LE Read fffe

i.e. it fails "to determine the endianness of the subsequent multibyte sequence to be read", instead it unconditionally assumes the endianness is the same as the codecvt_mode template argument.

3a17e7c4-fca4-4827-b2a7-11a54d73d746 commented 7 years ago

It seems that libc++ treats "consume_header" as "optionally discard a BOM that matches my current endianness" instead of "optionally read a BOM to determine the endianness"

i.e. you have to create the facet with the correct endianness in the first place, to be able to "consume" the BOM.

The standard says:

— If (Mode & consume_header), the facet shall consume an initial header sequence, if present, when reading a multibyte sequence to determine the endianness of the subsequent multibyte sequence to be read.

i.e. the facet must be able to switch endianness based on the BOM, not fix it at compile-time.