simonlynen / yaml-cpp

Automatically exported from code.google.com/p/yaml-cpp
MIT License
0 stars 0 forks source link

Support for accessing keys with std::wstring #22

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
From bobbycoul:

Well you convert string to wstring so unicode strings are not correctly read 
from the yaml file, 
let's say you have this:

{{{
--- # yaml file
vers_ف_ion: 1.0 xx_ف_xx
}}}

you see what i mean? what could be nice to do to get it would be something like:

{{{
wstring str;
node[L"vers_ف_ion"] >> str;
}}}
with the cpp file encoded with unicode or utf8.

do you get my point? So in this case all could be in utf8, the keys, and also 
any text in the file.

Original issue reported on code.google.com by jbe...@gmail.com on 17 Jul 2009 at 2:18

GoogleCodeExporter commented 9 years ago
I'm still trying to get the hang of this unicode/other character set thing. I 
tried messing around with your 
string `vers_ف_ion`. If we just create a `std::wstring` with the L"" syntax 
and output its characters as 
integers, I get

118 101 114 115 95 1601 95 105 111 110

On the other hand, reading it through the parser and then interpreting the 
`std::string` using the 
`std::mbstowcs` function (see src/conversion.cpp), we get

118 101 114 115 95 217 129 95 105 111 110

Why are these different? Incidentally, the above characters are also what are 
just in the `std::string` with no 
conversion. Why doesn't `mbstowcs` do anything here?

Original comment by jbe...@gmail.com on 17 Jul 2009 at 2:24

GoogleCodeExporter commented 9 years ago
mbstowcs uses the current locale (and thus probably a Latin-1 / CP 1252 
encoding) 
instead of a UTF-8 encoding. While 217 129 (hex: D9 81) is the UTF-8 encoding 
of 1601 
(U+0641), interpreted as Latin-1 encoding, it would just become U+00D9 U+0081.

Original comment by mur...@gmail.com on 17 Jul 2009 at 6:17

GoogleCodeExporter commented 9 years ago
So what should we do? On the MSDN pages, I found _mbstowcs_l that takes a 
specified local, but this doesn't 
appear to be portable (it's not on OS X). How can we specify that the 
multi-byte string is in UTF-8?

Original comment by jbe...@gmail.com on 17 Jul 2009 at 3:32

GoogleCodeExporter commented 9 years ago
I'll try to do a bit of research and tests.
I am not sure if it's a matter of locale.
maybe reading it directly in wstring instead of converting a string (that 
contains
utf8 chars) to wstring.

Original comment by bobbyc...@gmail.com on 17 Jul 2009 at 6:12

GoogleCodeExporter commented 9 years ago
Maybe you could just handle the 2 bytes utf8 chars, as wchar_t has size of 2 
bytes
(on MS Windows at least).
Anyway the list would not be so bad, look at the characters table of windows 
(if you
have a windows available), they just display all the table from U+0000 to 
U+FFFF.

Original comment by bobbyc...@gmail.com on 17 Jul 2009 at 8:38

GoogleCodeExporter commented 9 years ago
I know the size of a wchar_t is platform-dependent, and I'd like to have a 
cross-
platform solution here. In any case, the parser works on a stream of chars - 
this is 
consistent with everything else (including the YAML spec), and Richard's UTF 
streaming 
patch, so we won't have it reading a stream of wchar_t's.

Original comment by jbe...@gmail.com on 20 Jul 2009 at 8:28

GoogleCodeExporter commented 9 years ago
I'd like to share a little information on character encodings involving 
Unicode.  If
you'd like it with a little more humor, you can get some of it at
http://www.joelonsoftware.com/articles/Unicode.html, where Joel Spolsky does a
wonderful job of giving background.

The idea of Unicode is that each character is assigned a particular number 
called a
code point.  This code point has a value in the range 0-0x10FFFF.  Because the 
range
of the code point values is so large, a number of way have been developed to 
encode
these values into sequences of bytes (known as UTFs or Unicode Transformation 
Formats).

UTF-8 is very popular in cross-platform circles.  Every Unicode code point maps 
to a
sequence of 1-4 bytes in a UTF-8 encoded string.  Code points in the range 
0-0x7F
(the ASCII set) use their native value in one byte.  Code points in the ranges
0x80-0x7FF, 0x800-0xFFFF (minus 0xD800-0xDFFF, which is reserved), and
0x10000-0x10FFFF use two byte, three byte, and four byte sequences respectively.

C and C++ compilers and text editors are not necessarily very smart about string
constants.  I usually only trust them with correctly interpreting ASCII 
characters
and escape sequences.  The most trustworthy way I have found to encode Unicode 
string
literals that contain non-ASCII characters is to work out the UTF-8 encoding 
for the
code point and code the byte sequence as a sequence of octal escapes (because 
there
are problems with using decimal or hex values).  Because UTF-8's basic encoding 
level
is the byte (or octet to be more technical than necessary), these literals are
expressed just using the good-old double quote (no leading L), and helpfully 
UTF-8
still defines the code point 0 as the single byte sequence [0].

Thus, I would approach the problem originally given with the code:
{{{
std::string str;
node["vers_\0331\0201_ion"] >> str;
}}}
and expect str to receive the string encoded in UTF-8.

The other popular method (most popular on Windows) is variously known as UCS-2 
or
UTF-16.  There are slight differences between the two, mostly regarding the 
ability
to represent code points in the range 0x10000-0x10FFFF (UTF-16 has it and UCS-2
doesn't, at least officially).  This method is not highly supported on Linux or 
UNIX
based systems.  Each code point is represented by a single 16-bit value (or a 
pair of
values from the range 0xD800-0xDFFF in UTF-16 for the code point range
0x10000-0x10FFFF).  The 16-bit values can either be big-endian or little-endian
depending on the platform and/or usage.  String literals of this type have the
element type wchar_t and are expressed in code using the L"" syntax.

Please understand that UTF-8 or UTF-16 can encode any possible sequence of valid
Unicode code points; in terms of expressive power it does not matter which one 
you
choose.  Internally, yaml-cpp treats strings as UTF-8 which makes it 
significantly
easier to work with UTF-8 on the external interface.  It would also be quite 
possible
to add UTF-16 (BE or LE depending on the system) support to the API since UTF-8 
and
UTF-16 are essentially just different ways of expressing the same sequence of 
Unicode
code points.

Original comment by rtweeks21 on 29 Jul 2009 at 6:18

GoogleCodeExporter commented 9 years ago
OK, I had the syntax for the C string literal octal escape sequence wrong.  
Here's
the corrected code:
{{{
std::string str;
node["vers_\331\201_ion"] >> str;
}}}

Original comment by rtweeks21 on 4 Aug 2009 at 6:46

GoogleCodeExporter commented 9 years ago
Thanks, Richard, for the feedback and info. If I understand correctly, a 
possible
decision is to remove the `std::wstring` processing (which doesn't really work 
as I
expected, anyways) and simply rely on `std::string`, using the UTF-8 encoding.

Assuming this is OK (Richard, can you weigh in?), I'll leave it as is until 
someone
gets fed up and submits a UTF-16 patch (or something like that).

Original comment by jbe...@gmail.com on 11 Aug 2009 at 10:56

GoogleCodeExporter commented 9 years ago
I've created a Google Gadget at
http://hosting.gmodules.com/ig/gadgets/file/111180078345548400783/c-style-utf8-e
ncoder.xml
that takes a string of Unicode characters (paste them into the browser) and 
creates a
C string literal.  It also escapes quotation marks, but nothing else.

It should be possible to reference the gadget from a wiki page if we create a 
page on
"Strings in yaml-cpp".  The syntax is
{{{
<wiki:gadget
url="http://hosting.gmodules.com/ig/gadgets/file/111180078345548400783/c-style-u
tf8-encoder.xml">
}}}
See http://code.google.com/p/support/wiki/WikiSyntax for more details.

Original comment by rtweeks21 on 21 Aug 2009 at 8:23

GoogleCodeExporter commented 9 years ago
Thanks, Richard! I started the page 
http://code.google.com/p/yaml-cpp/wiki/Strings.
If you have any other suggestions, let me know. I'm also going to close this 
issue,
since I think it's resolved for the time being.

Original comment by jbe...@gmail.com on 22 Aug 2009 at 12:29