Core is not handling file encodings other than UTF-8 correctly

(Transferred from Bugzilla, #17, 24 July 2014)

Daniel Weck 2014-07-13 15:42:41 CDT

Probably a related issue reported at Adobe's Digital Editions v4 beta, with Japanese HTML content (the WebView on OSX just displays the source code, it does not render the markup as expected)(works okay in ReadiumJS).

[reply] [-] Description Ric Wright 2013-11-22 10:54:11 CST

Reported by Takeshi Kanai of Sony

The files I used contain TOC files which are encoded in UTF-8 with BOM. It was solved when I have changed the code to have the XML parser detect the encoding, but I'm not quite sure how the change effects to performance. Could you please look into it?

[target]ManifestItem::ReferencedDocument() in /epub3/epub/Components/manifest.cpp

Before) if ( _mediaType == "text/html" ) result = reader->htmlReadDocument(path.c_str(), "utf-8", flags); else result = reader->xmlReadDocument(path.c_str(), "utf-8", flags);

After) if ( _mediaType == "text/html" ) result = reader->htmlReadDocument(path.c_str(), nullptr, flags); else result = reader->xmlReadDocument(path.c_str(), nullptr, flags);

After that, just for my curiosity, I made some test files to verify how the SDK handles UTF-8, UTF-8 with bom and UTF-16 encoded files. This URL is the repository of the test files. https://github.com/tkanai/epub-testfiles/tree/master/encoding-check/epubfiles

Each file name consists of three identifiers. test-fileencoding-[A]-[B]-[C] [A] expresses container.xml file encoding [B] expresses .opf file encoding [C] expresses a navigation document encoding.

"8" is UTF-8, "8b" is UTF-8 with BOM and "16" is UTF-16. Each epub contains 3 normal pages and TOC. The normal pages are encoded in above encodings.

It seems that the SDK can handle container.xml encoding correctly, but it can not render UTF-16 pages at all. And when .opf contains BOM, I mean UTF-8 with BOM or UTF-16, it always fails.

readium / readium-sdk

Core is not handling file encodings other than UTF-8 correctly #73

Daniel Weck 2014-07-13 15:42:41 CDT

[reply] [-] Description Ric Wright 2013-11-22 10:54:11 CST