openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
40 stars 5 forks source link

Add option to specify how characters to consider when searching charset in content header #320

Closed benoit74 closed 1 week ago

benoit74 commented 1 week ago

At https://www.marxists.org/espanol/justo/suvida.htm, the charset specified in HTML header is unfortunately far away (we need 1028 bytes to find it in full, instead of the default 1024 bytes).

Currently, we arbitrarily decided to consider only the first 1024 bytes of the content to lookup for charset. While this default value makes sense as a compromise between capacity to find all charsets and performance / memory footprint, it would help a lot if we could customize this option for the rare cases like here where the content-type is specified, a bit custom (windows-1252 here), and we don't mind to explore more bytes on all contents.

I suggest we should add an option to customize this "magic number".