llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.62k stars 11.83k forks source link

codecvt::in() incorrectly handles partial multibyte inputs #14131

Open llvmbot opened 12 years ago

llvmbot commented 12 years ago
Bugzilla Link 13759
Version unspecified
OS All
Attachments testcase
Reporter LLVM Bugzilla Contributor
CC @kariya-mitsuru,@mclow

Extended Description

When supplied with a partial multibyte input (i.e. a multibyte prefix), codecvt::in() returns codecvt_base::error instead of codecvt_base::partial.

Steps to reproduce:

  1. Compile the attached testcase with e.g. $ clang -std=c++11 -stdlib=libc++ -o test test.cc and run it.

Expected results: The testcase should print "partial". Actual results: The testcase prints "error".

Additional notes: I've reproduced this bug on both my Mac and Linux systems using the above testcase, which calls codecvt::in() on the first byte of the three-byte UTF-8 string "ℝ".

The C++11 standard (22.4.1.4.2) seems to imply that codecvt_base::partial should be returned, since "additional source elements are needed before another destination element can be produced". However, in the above testcase, libstdc++ 4.7 on my Linux machine returns codecvt_base::ok, so I'm not sure what the correct behavior is.

The code responsible (locale.cpp:1386) seems to have some inverted logic. In particular,

n = mbsnrtowcs(...); if (n == size_t(-1)) ... if (n == 0) return error; seems wrong, since mbsnrtowcs() returns -1 on error, not 0 (which simply indicates that the length of the wide output string is 0).

I'm fairly sure this is a bug, however I'm kind of poking around in the dark here, so please correct me if this is the intended behavior.

e48d0f12-b0d3-485a-93ce-75b823b0ccac commented 9 years ago

I think that do_in should behave as below.

==================================================================================== codecvt<wchar_t, char, mbstate_t>::result codecvt<wchar_t, char, mbstate_t>::do_in(state_type& st, const extern_type frm, const extern_type frm_end, const extern_type& frm_nxt, intern_type to, intern_type to_end, intern_type& to_nxt) const { frm_nxt = frm; to_nxt = to; if (frm == frm_end) { mbstate_t save_st = st; return mbrtowc_l(NULL, NULL, 0, &save_st, l) == 0 ? ok : partial; } do { if (to_nxt == to_end) return partial; size_t n = __mbrtowc_l(to_nxt, frm_nxt, frm_end - frm_nxt, &st, __l); switch (n) { case size_t(-1): return error; case size_t(-2): frm_nxt = frm_end; return partial; case 0: ++frm_nxt; break; default: frm_nxt += n; break; } ++to_nxt; } while (frm_nxt != frm_end); return ok; }

e48d0f12-b0d3-485a-93ce-75b823b0ccac commented 9 years ago

As pointed out above, I think that it should return partial, not ok.

cf. http://melpon.org/wandbox/permlink/8Jo9s9qODgywSmtP

This behavior causes llvm/llvm-bugzilla-archive#24929 .

Additionally, if "from" ends with an incomplete multibyte sequence followed by a null byte, "from_next" should point the first byte of an incomplete multibyte sequence, not a null byte.

C++98 (or later) standard 22.2.1.5.2 codecvt virtual functions [lib.locale.codecvt.virtuals] p.2 says, "It always leaves the from_next and to_next pointers pointing one beyond the last element successfully converted."

cf. http://melpon.org/wandbox/permlink/7xsTdIqqmE1X3Adg

mclow commented 9 years ago

committed revision 233012 to fix this problem.

I'm going to leave this bug open, though, until I figure what the correct behavior is. The fix merely deals with the error correctly.

llvmbot commented 12 years ago

assigned to @mclow