No support for character encodings other than UTF-8

GoogleCodeExporter commented 9 years ago

The current implementation of the library assumes that the stream
(std::istream) provided to the parser will produce a stream of UTF-8 octets
(with no leading byte order mark).  This means that yaml-cpp does not
currently support reading files in UTF-16 or UTF-32 (either little- or
big-endian, with or without byte order mark), nor does it support reading
UTF-8 files with a leading byte order mark.

Attached is a patch to allow streams of octets in any of the YAML supported
character encodings, based on revision 177 of the trunk.

I also, incidentally, patched the fetch from the stream so that it fetches
a chunk (of up to YAML_PREFETCH_SIZE octets) at a time rather than seeking
on the stream (which usually fails for streams like std::cin) to determine
the stream's length.

Original issue reported on code.google.com by rtweeks21 on 3 Jul 2009 at 5:18

Attachments:

all-utfs.patch

GoogleCodeExporter commented 9 years ago

Sweet! This looks pretty solid - thanks for the great work!

I applied the patch in a new branch 'utf'. I found a small bug when I tested it 
in OS X (it worked in Windows 
with no problems that I found) - when reading a Windows newline CR/LF, we 
already have a mechanism for 
figuring out how many characters that newline takes up (the RegEx::Match 
returns the number of characters 
matched, so we use that).

I realized that you added some extra testing in the Stream::get function to see 
if it's reading a CR/LF newline 
- but it's actually designed to read a CR, and then a LF. See 
Scanner::ScanToNextToken for exactly where this 
is used. Anyways, I just switched it back to returning one character at a time 
and it works fine.

Also, it's slower than the original version. I added a little parse program 
that just parses a document (see 
util/parse) for testing, and I'm using the Duck.yaml file that someone sent in 
Issue 7 (it's 328 KB). Compiling 
in gcc with -O2, on my machine the trunk takes about .14s and the utf branch 
takes about .38s. Granted, this 
is a pretty big file and the times are pretty small anyways, but it's still 
about 2.5x slower. I haven't profiled the 
new version yet, but I will, and I'll see what I can do. If you want to play 
with that also, that'd be cool.

Thanks again, this is a much needed addition to the library!

Original comment by jbe...@gmail.com on 7 Jul 2009 at 9:59

Changed state: Started

GoogleCodeExporter commented 9 years ago

I inlined a lot of the code (see the svn log for more details), which included 
a substantial overhaul to the RegEx 
class, but this sped up the utf branch by about 2x. I'll still work more on it, 
but it's at the point where I'm happy 
(enough) with its speed to merge soon.

Original comment by jbe...@gmail.com on 9 Jul 2009 at 7:14

GoogleCodeExporter commented 9 years ago

I played with it a little more to squeeze a little more speed out of it, and I 
decided to merge it with the trunk. 
FWIW, my speed tests show that it's somewhere between 5% slower (on my Linux 
box) and 20% slower (on my 
Mac). (Of course, these tests are only for simple ASCII files.)

Thanks again! By the way, if someone opens an issue relating to this aspect of 
the code, would you be willing to 
help fix it? I've read through all of your patch, but of course you know how it 
works better than I do :)

Original comment by jbe...@gmail.com on 10 Jul 2009 at 3:39

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Aboslutely.  You've got my email address, so just pop me a line.  I even check 
the
issues list here from time to time, so I might even see it and start working on 
it. 
I don't have any kind of automated notification system though, so it wouldn't 
hurt to
forward any issue on to me in case I've forgotten to check.

Original comment by rtweeks21 on 10 Jul 2009 at 1:45

majintao0131 / yaml-cpp

No support for character encodings other than UTF-8 #20