Closed benoit74 closed 2 weeks ago
I've assembled some test files in https://github.com/openzim/warc2zim/pull/314
Luckily, it was quite fast to find enough files to get some conclusions:
file01.js
is most probably encoded using ISO-8859-1
but it is improperly detected as mac_latin2
by charset_normalizer
(chardet
correctly assumes it is most probably ISO-8859-1
)file02.js
is most probably encoded using UTF-8
but it is improperly detected as most probably Windows-1252
by chardet
(charset_normalizer
correctly assumes it is most probably UTF-8
)file03.html
is a sample where we have a bad character due to improper truncation of grapheme ; chardet
correctly assumes it is utf-8
(but decoding with it still fails due to one bad character); charset_normalizer
wrongly assumes it is most probably iso8859_10
- and the file achieves to be decoded with this charset but all accentuated characters are wrongfile04.js
and file05.js
are most probably encoded using ascii
(at least there is nothing outside standard ascii characters) but they are failing chardet
detection (no encoding detected at all) while charset_normalizer
correctly suggest ascii
.file06.html
is encoded by gb2312
but chardet
assumes it is probably ISO-8859-1
; charset_normalizer
proposes gb2312
because it is written in HTML <head>
file07.html
is the same file as file06.html
, but without the indication of charset in <head>
section. Still encoded with gb2312
. chardet
wrongly assumes it is probably ISO-8859-1
; charset_normalizer
wrongly assumes it is probably mac_iceland
So long story short: as soon as we do not have information about charset in HTTP Content-Type
header or in first bytes of the (HTML) document, or as soon as there is a bad character impossible to decode with charset indicated there, we currently considered we should use automatic character detection. This is not working with chardet
and charset_normalizer
libraries (which look like the most popular Python library to do this job).
My proposition is to simplify even further current logic to stop guessing, it is just too easy to get fooled and decode only bullshit (which is then reencoded to UTF-8 and becomes valid bullshit) or fail to decode with wrongly guessed charset.
I also consider we should given more confidence to the encoding found in document first 1024 bytes (even if it is slower to process) than to the one found in HTTP Content-Type
header. HTTP header is usually automatically populated by the webserver which in many cases has no idea about the real file charset, while the charset indicated in document is probably indicated by the tool/person who generated the file and is hence more probable to be true.
Logic would be simplified as such:
Content-Type
header ; if found, decode with this encoding in 'replace' mode (always going to work, do not care if encoding specified was wrong, not our fault)--guessing-charsets
so that it could be possible when needed to tweak this list when neededThis means that we could have higher chances to have bad content in the ZIM when source website is badly configured since we trust the charset declared in the document header or HTTP header ; but then it is clearly not our fault ; and guessing proved to be just doing more harm than good.
I also considered an alternative where we would not decode at all (or only fragments of the file) since the structure of the document is normally only ascii, special characters are found in strings only. While this could work and help from a technical perspective, it means that we break the ZIM specification which says that all content must be stored encoded in UTF-8 in the ZIM (so we must decode all strings. period.). And the technical implementation is not going to be simple / straightforward.
Before https://github.com/openzim/warc2zim/pull/302, all JS / JSON (and CSS) documents were supposed to be encoded in UTF-8.
This was unreliable because some sites are not using "UTF-8" for encoding these documents.
The PR hence modified code to use the "automated detection" already in place for HTML document.
With this automated detection algorithm, we try in order:
chardet
most probably encodingUnfortunately, it looks like this automatic detection is not that reliable. This is especially visible for JS because when no encoding is received from HTTP header, we do not have encoding specified in first bytes of the document either, and so it means we are back to simply relying on the fact that
chardet
most probable encoding is going to work.While it gave good results on the files we tested, it seems that
chardet
is also very poorly performing on other situations.E.g. it is failing to properly decode "https://www.cloudflare.com/vendor/onetrust/scripttemplates/202308.2.0/otBannerSdk.js" (which is just UTF-8 but detected as Windows-1252 by chardet after a very long heuristic - about 3 secs on my Linux server).
Given all these problems, it is now clear that we first need to assemble a test set of files that are now known to be difficult to decode based on our experience and gather strings in those files which we know how they should be decoded.
Then based on this test set we will be able to decide whether an automated approach still seems feasible (and which one) or if it is just impossible and the only most reasonable compromise is to allow to specify encoding to use when unknown at the CLI (with potentially multiple encoding needed per conversion, so potentially needing pattern matching rules ...).
Nota: my hopes for an automated solution are decreasing ; while researching a bit the web I discovered that even "big" libraries like httpx are struggling on the matter. It looks like they started with
chardet
, then switched to fully manual heurisitic (https://github.com/encode/httpx/pull/1269) and are now usingcharset-normalizer
(https://github.com/encode/httpx/pull/1791). And "spoiler",charset-normalizer
is not properly decoding the content at https://www.marxists.org/espanol/menu.js (which is one of our test cases).We should also need to keep in mind that bad characters exists "for real" on some document on the web (see https://github.com/openzim/warc2zim/issues/221 where we have a document from https://www.solidarite-numerique.fr/tutoriels/comprendre-les-cookies/ which is mostly only UTF-8 chars - accent works as expected, ... - but contains a bad character which is impossible to decode in UTF-8) which make this decoding task even harder.