Closed GoogleCodeExporter closed 9 years ago
I'm still trying to get the hang of this unicode/other character set thing. I
tried messing around with your
string `vers_ف_ion`. If we just create a `std::wstring` with the L"" syntax
and output its characters as
integers, I get
118 101 114 115 95 1601 95 105 111 110
On the other hand, reading it through the parser and then interpreting the
`std::string` using the
`std::mbstowcs` function (see src/conversion.cpp), we get
118 101 114 115 95 217 129 95 105 111 110
Why are these different? Incidentally, the above characters are also what are
just in the `std::string` with no
conversion. Why doesn't `mbstowcs` do anything here?
Original comment by jbe...@gmail.com
on 17 Jul 2009 at 2:24
mbstowcs uses the current locale (and thus probably a Latin-1 / CP 1252
encoding)
instead of a UTF-8 encoding. While 217 129 (hex: D9 81) is the UTF-8 encoding
of 1601
(U+0641), interpreted as Latin-1 encoding, it would just become U+00D9 U+0081.
Original comment by mur...@gmail.com
on 17 Jul 2009 at 6:17
So what should we do? On the MSDN pages, I found _mbstowcs_l that takes a
specified local, but this doesn't
appear to be portable (it's not on OS X). How can we specify that the
multi-byte string is in UTF-8?
Original comment by jbe...@gmail.com
on 17 Jul 2009 at 3:32
I'll try to do a bit of research and tests.
I am not sure if it's a matter of locale.
maybe reading it directly in wstring instead of converting a string (that
contains
utf8 chars) to wstring.
Original comment by bobbyc...@gmail.com
on 17 Jul 2009 at 6:12
Maybe you could just handle the 2 bytes utf8 chars, as wchar_t has size of 2
bytes
(on MS Windows at least).
Anyway the list would not be so bad, look at the characters table of windows
(if you
have a windows available), they just display all the table from U+0000 to
U+FFFF.
Original comment by bobbyc...@gmail.com
on 17 Jul 2009 at 8:38
I know the size of a wchar_t is platform-dependent, and I'd like to have a
cross-
platform solution here. In any case, the parser works on a stream of chars -
this is
consistent with everything else (including the YAML spec), and Richard's UTF
streaming
patch, so we won't have it reading a stream of wchar_t's.
Original comment by jbe...@gmail.com
on 20 Jul 2009 at 8:28
I'd like to share a little information on character encodings involving
Unicode. If
you'd like it with a little more humor, you can get some of it at
http://www.joelonsoftware.com/articles/Unicode.html, where Joel Spolsky does a
wonderful job of giving background.
The idea of Unicode is that each character is assigned a particular number
called a
code point. This code point has a value in the range 0-0x10FFFF. Because the
range
of the code point values is so large, a number of way have been developed to
encode
these values into sequences of bytes (known as UTFs or Unicode Transformation
Formats).
UTF-8 is very popular in cross-platform circles. Every Unicode code point maps
to a
sequence of 1-4 bytes in a UTF-8 encoded string. Code points in the range
0-0x7F
(the ASCII set) use their native value in one byte. Code points in the ranges
0x80-0x7FF, 0x800-0xFFFF (minus 0xD800-0xDFFF, which is reserved), and
0x10000-0x10FFFF use two byte, three byte, and four byte sequences respectively.
C and C++ compilers and text editors are not necessarily very smart about string
constants. I usually only trust them with correctly interpreting ASCII
characters
and escape sequences. The most trustworthy way I have found to encode Unicode
string
literals that contain non-ASCII characters is to work out the UTF-8 encoding
for the
code point and code the byte sequence as a sequence of octal escapes (because
there
are problems with using decimal or hex values). Because UTF-8's basic encoding
level
is the byte (or octet to be more technical than necessary), these literals are
expressed just using the good-old double quote (no leading L), and helpfully
UTF-8
still defines the code point 0 as the single byte sequence [0].
Thus, I would approach the problem originally given with the code:
{{{
std::string str;
node["vers_\0331\0201_ion"] >> str;
}}}
and expect str to receive the string encoded in UTF-8.
The other popular method (most popular on Windows) is variously known as UCS-2
or
UTF-16. There are slight differences between the two, mostly regarding the
ability
to represent code points in the range 0x10000-0x10FFFF (UTF-16 has it and UCS-2
doesn't, at least officially). This method is not highly supported on Linux or
UNIX
based systems. Each code point is represented by a single 16-bit value (or a
pair of
values from the range 0xD800-0xDFFF in UTF-16 for the code point range
0x10000-0x10FFFF). The 16-bit values can either be big-endian or little-endian
depending on the platform and/or usage. String literals of this type have the
element type wchar_t and are expressed in code using the L"" syntax.
Please understand that UTF-8 or UTF-16 can encode any possible sequence of valid
Unicode code points; in terms of expressive power it does not matter which one
you
choose. Internally, yaml-cpp treats strings as UTF-8 which makes it
significantly
easier to work with UTF-8 on the external interface. It would also be quite
possible
to add UTF-16 (BE or LE depending on the system) support to the API since UTF-8
and
UTF-16 are essentially just different ways of expressing the same sequence of
Unicode
code points.
Original comment by rtweeks21
on 29 Jul 2009 at 6:18
OK, I had the syntax for the C string literal octal escape sequence wrong.
Here's
the corrected code:
{{{
std::string str;
node["vers_\331\201_ion"] >> str;
}}}
Original comment by rtweeks21
on 4 Aug 2009 at 6:46
Thanks, Richard, for the feedback and info. If I understand correctly, a
possible
decision is to remove the `std::wstring` processing (which doesn't really work
as I
expected, anyways) and simply rely on `std::string`, using the UTF-8 encoding.
Assuming this is OK (Richard, can you weigh in?), I'll leave it as is until
someone
gets fed up and submits a UTF-16 patch (or something like that).
Original comment by jbe...@gmail.com
on 11 Aug 2009 at 10:56
I've created a Google Gadget at
http://hosting.gmodules.com/ig/gadgets/file/111180078345548400783/c-style-utf8-e
ncoder.xml
that takes a string of Unicode characters (paste them into the browser) and
creates a
C string literal. It also escapes quotation marks, but nothing else.
It should be possible to reference the gadget from a wiki page if we create a
page on
"Strings in yaml-cpp". The syntax is
{{{
<wiki:gadget
url="http://hosting.gmodules.com/ig/gadgets/file/111180078345548400783/c-style-u
tf8-encoder.xml">
}}}
See http://code.google.com/p/support/wiki/WikiSyntax for more details.
Original comment by rtweeks21
on 21 Aug 2009 at 8:23
Thanks, Richard! I started the page
http://code.google.com/p/yaml-cpp/wiki/Strings.
If you have any other suggestions, let me know. I'm also going to close this
issue,
since I think it's resolved for the time being.
Original comment by jbe...@gmail.com
on 22 Aug 2009 at 12:29
Original issue reported on code.google.com by
jbe...@gmail.com
on 17 Jul 2009 at 2:18