openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
40 stars 5 forks source link

LookupError: unknown encoding: unicode #331

Open rgaudin opened 6 days ago

rgaudin commented 6 days ago

Not sure if still valid (we merged related stuff last week I think) but this zimit.kiwix.org run failed in rewrite

Traceback (most recent call last):
  File "/usr/bin/zimit", line 8, in <module>
    sys.exit(zimit.zimit())
             ^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/zimit/zimit.py", line 585, in zimit
    run(sys.argv[1:])
  File "/app/zimit/lib/python3.12/site-packages/zimit/zimit.py", line 507, in run
    return warc2zim(warc2zim_args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/main.py", line 146, in main
    return converter.run()
           ^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 330, in run
    self.add_items_for_warc_record(record)
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 748, in add_items_for_warc_record
    payload_item = WARCPayloadItem(
                   ^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/items.py", line 56, in __init__
    ).rewrite(pre_head_template, post_head_template)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/content_rewriting/generic.py", line 108, in rewrite
    return self.rewrite_html(pre_head_template, post_head_template)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/content_rewriting/generic.py", line 247, in rewrite_html
    ).rewrite(self.content_str)
              ^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/content_rewriting/generic.py", line 93, in content_str
    return to_string(
           ^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/utils.py", line 175, in to_string
    return input_.decode(head_encoding, errors="replace")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
LookupError: unknown encoding: unicode
rgaudin commented 6 days ago

And https://farm.zimit.kiwix.org/pipeline/2c1770f9-1281-48f0-ab8a-4bb2a727552b/debug

benoit74 commented 6 days ago

Thank you!

benoit74 commented 3 days ago

Same problem with iso-utf-8 on https://farm.openzim.org/pipeline/d1fa0c7a-29c2-4229-80cf-686aba6ac0f5/debug

Note that on https://farm.zimit.kiwix.org/pipeline/2c1770f9-1281-48f0-ab8a-4bb2a727552b/debug the problem was with iso-8559-1 (this is a typo, encoding is most probably iso-8859-1)

benoit74 commented 3 days ago

So problematic pages are https://www.qsl.net/emporiaars/newsletter.html, https://marxists.incn.su/history/etol/writers/goldman/1936/09/campaign.html and https://www.qsl.net/vk2jem/swlogs.htm