openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
40 stars 5 forks source link

Automated encoding detection is still not working properly #312

Closed benoit74 closed 2 weeks ago

benoit74 commented 2 weeks ago

Before https://github.com/openzim/warc2zim/pull/302, all JS / JSON (and CSS) documents were supposed to be encoded in UTF-8.

This was unreliable because some sites are not using "UTF-8" for encoding these documents.

The PR hence modified code to use the "automated detection" already in place for HTML document.

With this automated detection algorithm, we try in order:

Unfortunately, it looks like this automatic detection is not that reliable. This is especially visible for JS because when no encoding is received from HTTP header, we do not have encoding specified in first bytes of the document either, and so it means we are back to simply relying on the fact that chardet most probable encoding is going to work.

While it gave good results on the files we tested, it seems that chardet is also very poorly performing on other situations.

E.g. it is failing to properly decode "https://www.cloudflare.com/vendor/onetrust/scripttemplates/202308.2.0/otBannerSdk.js" (which is just UTF-8 but detected as Windows-1252 by chardet after a very long heuristic - about 3 secs on my Linux server).

Given all these problems, it is now clear that we first need to assemble a test set of files that are now known to be difficult to decode based on our experience and gather strings in those files which we know how they should be decoded.

Then based on this test set we will be able to decide whether an automated approach still seems feasible (and which one) or if it is just impossible and the only most reasonable compromise is to allow to specify encoding to use when unknown at the CLI (with potentially multiple encoding needed per conversion, so potentially needing pattern matching rules ...).

Nota: my hopes for an automated solution are decreasing ; while researching a bit the web I discovered that even "big" libraries like httpx are struggling on the matter. It looks like they started with chardet, then switched to fully manual heurisitic (https://github.com/encode/httpx/pull/1269) and are now using charset-normalizer (https://github.com/encode/httpx/pull/1791). And "spoiler", charset-normalizer is not properly decoding the content at https://www.marxists.org/espanol/menu.js (which is one of our test cases).

We should also need to keep in mind that bad characters exists "for real" on some document on the web (see https://github.com/openzim/warc2zim/issues/221 where we have a document from https://www.solidarite-numerique.fr/tutoriels/comprendre-les-cookies/ which is mostly only UTF-8 chars - accent works as expected, ... - but contains a bad character which is impossible to decode in UTF-8) which make this decoding task even harder.

benoit74 commented 2 weeks ago

I've assembled some test files in https://github.com/openzim/warc2zim/pull/314

Luckily, it was quite fast to find enough files to get some conclusions:

So long story short: as soon as we do not have information about charset in HTTP Content-Type header or in first bytes of the (HTML) document, or as soon as there is a bad character impossible to decode with charset indicated there, we currently considered we should use automatic character detection. This is not working with chardet and charset_normalizer libraries (which look like the most popular Python library to do this job).

My proposition is to simplify even further current logic to stop guessing, it is just too easy to get fooled and decode only bullshit (which is then reencoded to UTF-8 and becomes valid bullshit) or fail to decode with wrongly guessed charset.

I also consider we should given more confidence to the encoding found in document first 1024 bytes (even if it is slower to process) than to the one found in HTTP Content-Type header. HTTP header is usually automatically populated by the webserver which in many cases has no idea about the real file charset, while the charset indicated in document is probably indicated by the tool/person who generated the file and is hence more probable to be true.

Logic would be simplified as such:

This means that we could have higher chances to have bad content in the ZIM when source website is badly configured since we trust the charset declared in the document header or HTTP header ; but then it is clearly not our fault ; and guessing proved to be just doing more harm than good.

I also considered an alternative where we would not decode at all (or only fragments of the file) since the structure of the document is normally only ascii, special characters are found in strings only. While this could work and help from a technical perspective, it means that we break the ZIM specification which says that all content must be stored encoded in UTF-8 in the ZIM (so we must decode all strings. period.). And the technical implementation is not going to be simple / straightforward.