openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
41 stars 5 forks source link

Charset declared in HTML documents are not rewritten #253

Closed benoit74 closed 3 weeks ago

benoit74 commented 1 month ago

While we always rewrite all documents in UTF-8 (as recommended by the ZIM specification), we do not update accordingly the charset declared in the HTML document headers

When present, both the Content-Type and the charset meta should probably be updated to always indicated UTF-8. Most browsers do not care much about these values and have implemented nice fallbacks, so it is not an urgent thing to fix, but probably still important to produce valid HTML documents inside the ZIM.

E.g.

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-32" />
    <meta charset="UTF-32" />

Should be fixed to

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <meta charset="UTF-8" />