openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
40 stars 5 forks source link

Add options to ignore charsets found automatically #318

Closed benoit74 closed 1 week ago

benoit74 commented 1 week ago

In order to cope with lying websites, it would be useful to have two options:

While quite "rough" (this change the scraper behavior for the whole website), this will allow websites like http://www.bouquineux.com to still be scrapped properly. This website seems to systematically indicate that HTML content is encode with iso-8859-1 while this fails to decode any accentuated characters, and it really looks like this has been encoded with UTF-8. See e.g. http://www.bouquineux.com/?ebooks=164&Carco for improperly encoded characters. Without these new settings, content is improperly decoded, and since accentuated characters are present even in URL (where they should be URL encoded but ...), the ZIM is simply not usable at all.