Detect ISO-8859-1 encoding in files and re-encode

GoogleCodeExporter commented 8 years ago

Although epub is required to contain only UTF-8 or UTF-16, of course it's
possible to sneak in ISO-8559-1.  Some characters in that set don't map to
Unicode and in the live/staging environment they are blindly added to the
database and then get truncated at the first invalid character.  The user
isn't notified.

This is especially bad when it happens in the NCX or OPF file, as they
become invalid XML once they go into the database, but aren't invalid when
they come out of the ePub archive, so the initial sanity checks on upload pass.

The best outcome is probably for Bookworm to always manage to convert the
file properly before saving, although I'm not sure yet how best to do that
as this particular truncation problem doesn't happen in my local
environment (instead I get a DjangoUnicodeEncode exception immediately on
upload).

Original issue reported on code.google.com by liza31337@gmail.com on 14 May 2009 at 11:15

GoogleCodeExporter commented 8 years ago

Filed related problem with epubcheck as it is passing such epubs:
http://code.google.com/p/epubcheck/issues/detail?id=34

Original comment by liza31337@gmail.com on 14 May 2009 at 11:24

GoogleCodeExporter commented 8 years ago

Original comment by liza31337@gmail.com on 19 May 2009 at 2:57

Added labels: Type-Enhancement

GoogleCodeExporter commented 8 years ago

These books are invalid.

Original comment by liza31337@gmail.com on 13 Nov 2009 at 5:33

Changed state: WontFix

purnimagupta / threepress

Detect ISO-8859-1 encoding in files and re-encode #145