Add option to specify how characters to consider when searching charset in content header

At https://www.marxists.org/espanol/justo/suvida.htm, the charset specified in HTML header is unfortunately far away (we need 1028 bytes to find it in full, instead of the default 1024 bytes).

Currently, we arbitrarily decided to consider only the first 1024 bytes of the content to lookup for charset. While this default value makes sense as a compromise between capacity to find all charsets and performance / memory footprint, it would help a lot if we could customize this option for the rare cases like here where the content-type is specified, a bit custom (windows-1252 here), and we don't mind to explore more bytes on all contents.

I suggest we should add an option to customize this "magic number".

openzim / warc2zim

Add option to specify how characters to consider when searching charset in content header #320