Automated encoding detection is still not working properly

Before https://github.com/openzim/warc2zim/pull/302, all JS / JSON (and CSS) documents were supposed to be encoded in UTF-8.

This was unreliable because some sites are not using "UTF-8" for encoding these documents.

The PR hence modified code to use the "automated detection" already in place for HTML document.

With this automated detection algorithm, we try in order:

if available, try to use encoding received from HTTP headers
if available, try to use encoding found in first bytes of the document, based on a regex (hopping first bytes are only using the ascii table)
try to use chardet most probably encoding
if available, fallback to encoding received from HTTP headers ignoring all bad chars

Unfortunately, it looks like this automatic detection is not that reliable. This is especially visible for JS because when no encoding is received from HTTP header, we do not have encoding specified in first bytes of the document either, and so it means we are back to simply relying on the fact that chardet most probable encoding is going to work.

While it gave good results on the files we tested, it seems that chardet is also very poorly performing on other situations.

E.g. it is failing to properly decode "https://www.cloudflare.com/vendor/onetrust/scripttemplates/202308.2.0/otBannerSdk.js" (which is just UTF-8 but detected as Windows-1252 by chardet after a very long heuristic - about 3 secs on my Linux server).

Given all these problems, it is now clear that we first need to assemble a test set of files that are now known to be difficult to decode based on our experience and gather strings in those files which we know how they should be decoded.

Then based on this test set we will be able to decide whether an automated approach still seems feasible (and which one) or if it is just impossible and the only most reasonable compromise is to allow to specify encoding to use when unknown at the CLI (with potentially multiple encoding needed per conversion, so potentially needing pattern matching rules ...).

Nota: my hopes for an automated solution are decreasing ; while researching a bit the web I discovered that even "big" libraries like httpx are struggling on the matter. It looks like they started with chardet, then switched to fully manual heurisitic (https://github.com/encode/httpx/pull/1269) and are now using charset-normalizer (https://github.com/encode/httpx/pull/1791). And "spoiler", charset-normalizer is not properly decoding the content at https://www.marxists.org/espanol/menu.js (which is one of our test cases).

We should also need to keep in mind that bad characters exists "for real" on some document on the web (see https://github.com/openzim/warc2zim/issues/221 where we have a document from https://www.solidarite-numerique.fr/tutoriels/comprendre-les-cookies/ which is mostly only UTF-8 chars - accent works as expected, ... - but contains a bad character which is impossible to decode in UTF-8) which make this decoding task even harder.

I've assembled some test files in https://github.com/openzim/warc2zim/pull/314

Luckily, it was quite fast to find enough files to get some conclusions:

file01.js is most probably encoded using ISO-8859-1 but it is improperly detected as mac_latin2 by charset_normalizer (chardet correctly assumes it is most probably ISO-8859-1)
file02.js is most probably encoded using UTF-8 but it is improperly detected as most probably Windows-1252 by chardet (charset_normalizer correctly assumes it is most probably UTF-8)
file03.html is a sample where we have a bad character due to improper truncation of grapheme ; chardet correctly assumes it is utf-8 (but decoding with it still fails due to one bad character); charset_normalizer wrongly assumes it is most probably iso8859_10 - and the file achieves to be decoded with this charset but all accentuated characters are wrong
file04.js and file05.js are most probably encoded using ascii (at least there is nothing outside standard ascii characters) but they are failing chardet detection (no encoding detected at all) while charset_normalizer correctly suggest ascii.
file06.html is encoded by gb2312 but chardet assumes it is probably ISO-8859-1 ; charset_normalizer proposes gb2312 because it is written in HTML <head>
file07.html is the same file as file06.html, but without the indication of charset in <head> section. Still encoded with gb2312. chardet wrongly assumes it is probably ISO-8859-1 ; charset_normalizer wrongly assumes it is probably mac_iceland

So long story short: as soon as we do not have information about charset in HTTP Content-Type header or in first bytes of the (HTML) document, or as soon as there is a bad character impossible to decode with charset indicated there, we currently considered we should use automatic character detection. This is not working with chardet and charset_normalizer libraries (which look like the most popular Python library to do this job).

My proposition is to simplify even further current logic to stop guessing, it is just too easy to get fooled and decode only bullshit (which is then reencoded to UTF-8 and becomes valid bullshit) or fail to decode with wrongly guessed charset.

I also consider we should given more confidence to the encoding found in document first 1024 bytes (even if it is slower to process) than to the one found in HTTP Content-Type header. HTTP header is usually automatically populated by the webserver which in many cases has no idea about the real file charset, while the charset indicated in document is probably indicated by the tool/person who generated the file and is hence more probable to be true.

Logic would be simplified as such:

first try to find charset in document first 1024 bytes with the ascii encoding and regex ; if found, decode with this encoding in 'replace' mode (always going to work, do not care if encoding specified was wrong, not our fault)
if no charset in found in document first 1024 bytes, search for charset in HTTP Content-Type header ; if found, decode with this encoding in 'replace' mode (always going to work, do not care if encoding specified was wrong, not our fault)
if no charset found in document or HTTP header, we will have to guess a little bit (situation is quite common for JS and CSS); try UTF-8 in 'strict' mode ; if it fails, try ISO-8859-1 in 'strict' mode ; if it fails, stop the scraper (we cannot guess encoding when not specified in document or HTTP header) ; this list of charset to try should even probably be exposed as a scraper parameter --guessing-charsets so that it could be possible when needed to tweak this list when needed

This means that we could have higher chances to have bad content in the ZIM when source website is badly configured since we trust the charset declared in the document header or HTTP header ; but then it is clearly not our fault ; and guessing proved to be just doing more harm than good.

I also considered an alternative where we would not decode at all (or only fragments of the file) since the structure of the document is normally only ascii, special characters are found in strings only. While this could work and help from a technical perspective, it means that we break the ZIM specification which says that all content must be stored encoded in UTF-8 in the ZIM (so we must decode all strings. period.). And the technical implementation is not going to be simple / straightforward.

openzim / warc2zim

Automated encoding detection is still not working properly #312