purnimagupta / threepress

Automatically exported from code.google.com/p/threepress
Other
0 stars 0 forks source link

Handle encoding errors with friendly error message #61

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Someone's been uploading epubs containing files that aren't in UTF-8.  It
would be nice to explicitly handle that error and explain it.

Original issue reported on code.google.com by liza31337@gmail.com on 10 Sep 2008 at 2:09

GoogleCodeExporter commented 8 years ago
The spec says that content must be in UTF-8 or UTF-16.

lxml can handle UTF-16, so we should be able to as well:

>>> from lxml import etree
>>> test = u'<foo>bar</foo>'
>>> etree.XML(test)
<Element foo at 1678300>
>>> etree.XML(test.encode('utf-8'))
<Element foo at 1678210>
>>> etree.XML(test.encode('utf-16'))
<Element foo at 1678360>
>>> etree.XML(test.encode('utf-32'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
LookupError: unknown encoding: utf-32

1. Create test case where content is in UTF-16 and verify that this works
2. Add error messaging to handle other UnicodeDecodeExceptions as likely some 
broken
(i.e. Windows) encoding

Original comment by liza31337@gmail.com on 11 Sep 2008 at 3:14

GoogleCodeExporter commented 8 years ago
(assuming of course that the encoding declaration is right)

Original comment by liza31337@gmail.com on 11 Sep 2008 at 3:17