In order to cope with lying websites, it would be useful to have two options:
one to ignore the charset found automatically in content first bytes (typically HTML documents)
one to ignore the charset found automatically in content HTTP headers Content-Type
While quite "rough" (this change the scraper behavior for the whole website), this will allow websites like http://www.bouquineux.com to still be scrapped properly. This website seems to systematically indicate that HTML content is encode with iso-8859-1 while this fails to decode any accentuated characters, and it really looks like this has been encoded with UTF-8. See e.g. http://www.bouquineux.com/?ebooks=164&Carco for improperly encoded characters. Without these new settings, content is improperly decoded, and since accentuated characters are present even in URL (where they should be URL encoded but ...), the ZIM is simply not usable at all.
In order to cope with lying websites, it would be useful to have two options:
Content-Type
While quite "rough" (this change the scraper behavior for the whole website), this will allow websites like http://www.bouquineux.com to still be scrapped properly. This website seems to systematically indicate that HTML content is encode with
iso-8859-1
while this fails to decode any accentuated characters, and it really looks like this has been encoded with UTF-8. See e.g. http://www.bouquineux.com/?ebooks=164&Carco for improperly encoded characters. Without these new settings, content is improperly decoded, and since accentuated characters are present even in URL (where they should be URL encoded but ...), the ZIM is simply not usable at all.