kovidgoyal / html5-parser

Fast C based HTML 5 parsing for python
Apache License 2.0
678 stars 33 forks source link

UnicodeDecodeError when parsing a (supposedly) UTF-7 encoded page #14

Closed VinDuv closed 6 years ago

VinDuv commented 6 years ago

I have a Python module that uses html5-parser to extract information from web pages. It had a problem with this page: http://ynformatics.com/ The problem is that the page declares its encoding as UTF-7 (both in the HTTP Content-type header and the <meta charset> tag) but is actually ASCII. Decoding it as UTF-7 fails because of the stray “+” characters in the page (they should be encoded as “+-”).

Instead of using the fallback encoding in that case, html5-parser raises a UnicodeDecodeError. I managed to reproduce the problem with a smaller test case:

>>> html5_parser.parse("+xxx", transport_encoding="utf-7", fallback_encoding='iso-8859-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "html5_parser/__init__.py", line 185, in parse
    data = as_utf8(html or b'', transport_encoding, fallback_encoding)
  File "html5_parser/__init__.py", line 83, in as_utf8
    data = bytes_or_unicode.decode(transport_encoding).encode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_7.py", line 12, in decode
    return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-3: unterminated shift sequence
kovidgoyal commented 6 years ago

That's expected behavior. fallback_encoding is used only if no encoding could be detected, which means no <meta> tag and no transport_encoding. https://html5-parser.readthedocs.io/en/latest/#api-documentation

You can decode the page to unicode manually before passing it to parse() if needed. Simply catch the UnicodeDecodeError and decode manually and call parse again.

wumpus commented 6 years ago

This is normal for web tools and web archiving tools, alas. I have a hairball that does this fallback properly in https://github.com/cocrawler/cocrawler which I should split into a separate module so it can be used by folks like @VinDuv.