readium / readium-sdk

A C++ ePub renderer SDK
BSD 3-Clause "New" or "Revised" License
390 stars 163 forks source link

Core is not handling file encodings other than UTF-8 correctly #73

Open rkwright opened 10 years ago

rkwright commented 10 years ago

(Transferred from Bugzilla, #17, 24 July 2014)

Daniel Weck 2014-07-13 15:42:41 CDT

Probably a related issue reported at Adobe's Digital Editions v4 beta, with Japanese HTML content (the WebView on OSX just displays the source code, it does not render the markup as expected)(works okay in ReadiumJS).

[reply] [-] Description Ric Wright 2013-11-22 10:54:11 CST

Reported by Takeshi Kanai of Sony

The files I used contain TOC files which are encoded in UTF-8 with BOM. It was solved when I have changed the code to have the XML parser detect the encoding, but I'm not quite sure how the change effects to performance. Could you please look into it?

[target]ManifestItem::ReferencedDocument() in /epub3/epub/Components/manifest.cpp

Before) if ( _mediaType == "text/html" ) result = reader->htmlReadDocument(path.c_str(), "utf-8", flags); else result = reader->xmlReadDocument(path.c_str(), "utf-8", flags);

After) if ( _mediaType == "text/html" ) result = reader->htmlReadDocument(path.c_str(), nullptr, flags); else result = reader->xmlReadDocument(path.c_str(), nullptr, flags);

After that, just for my curiosity, I made some test files to verify how the SDK handles UTF-8, UTF-8 with bom and UTF-16 encoded files. This URL is the repository of the test files. https://github.com/tkanai/epub-testfiles/tree/master/encoding-check/epubfiles

Each file name consists of three identifiers. test-fileencoding-[A]-[B]-[C] [A] expresses container.xml file encoding [B] expresses .opf file encoding [C] expresses a navigation document encoding.

"8" is UTF-8, "8b" is UTF-8 with BOM and "16" is UTF-16. Each epub contains 3 normal pages and TOC. The normal pages are encoded in above encodings.

It seems that the SDK can handle container.xml encoding correctly, but it can not render UTF-16 pages at all. And when .opf contains BOM, I mean UTF-8 with BOM or UTF-16, it always fails.

bdares commented 10 years ago

I've observed similar behavior on Windows: the string being passed to the XML parser doesn't have the BOM stripped out. I'm not sure if this means the file contents are being read correctly (what happens if the file was written Big-Endian?)