Closed VinDuv closed 6 years ago
That's expected behavior. fallback_encoding is used only if no encoding could be detected, which means no <meta>
tag and no transport_encoding. https://html5-parser.readthedocs.io/en/latest/#api-documentation
You can decode the page to unicode manually before passing it to parse() if needed. Simply catch the UnicodeDecodeError and decode manually and call parse again.
This is normal for web tools and web archiving tools, alas. I have a hairball that does this fallback properly in https://github.com/cocrawler/cocrawler which I should split into a separate module so it can be used by folks like @VinDuv.
I have a Python module that uses html5-parser to extract information from web pages. It had a problem with this page: http://ynformatics.com/ The problem is that the page declares its encoding as UTF-7 (both in the HTTP Content-type header and the <meta charset> tag) but is actually ASCII. Decoding it as UTF-7 fails because of the stray “+” characters in the page (they should be encoded as “+-”).
Instead of using the fallback encoding in that case, html5-parser raises a UnicodeDecodeError. I managed to reproduce the problem with a smaller test case: