openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
45 stars 4 forks source link

Invalid UTF-8 (surrogates) sent to libzim #382

Open rgaudin opened 2 months ago

rgaudin commented 2 months ago

In this zimit run (logs gone) of https://www.meds.cl/, it seems the string added as content is not valid UTF-8 (U+DBFF in this case), having the libzim refuse it.

This SO answer suggests this could be encoded JSON.

Traceback (most recent call last):
  File "/usr/bin/zimit", line 8, in <module>
    sys.exit(zimit.zimit())
             ^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/zimit/zimit.py", line 695, in zimit
    run(sys.argv[1:])
  File "/app/zimit/lib/python3.12/site-packages/zimit/zimit.py", line 616, in run
    return warc2zim(warc2zim_args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/main.py", line 168, in main
    return converter.run()
           ^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 384, in run
    self.add_items_for_warc_record(record)
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 965, in add_items_for_warc_record
    raise exc
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 962, in add_items_for_warc_record
    self.creator.add_item(payload_item)
  File "/app/zimit/lib/python3.12/site-packages/zimscraperlib/zim/creator.py", line 468, in add_item
    raise exc
  File "/app/zimit/lib/python3.12/site-packages/zimscraperlib/zim/creator.py", line 465, in add_item
    super().add_item(item)
  File "libzim/libzim.pyx", line 358, in libzim._Creator.add_item
RuntimeError: Traceback (most recent call last):
  File "libzim/libzim.pyx", line 121, in libzim.contentprovider_cy_call_fct
  File "libzim/libzim.pyx", line 85, in libzim.call_method
  File "/app/zimit/lib/python3.12/site-packages/zimscraperlib/zim/items.py", line 130, in get_contentprovider
    return StringProvider(content=content, ref=self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/zimscraperlib/zim/providers.py", line 36, in __init__
    super().__init__(content)
  File "libzim/libzim.pyx", line 469, in libzim.StringProvider.__init__
UnicodeEncodeError: 'utf-8' codec can't encode character '\udbff' in position 493623: surrogates not allowed

I believe this should be sorted out before passing it to libzim but I haven't dug into the issue

benoit74 commented 2 months ago

To be precise, I believe it is not the libzim but the python-libzim which refuses to encode:

Python 3.12.1 (main, Jan  9 2024, 15:41:00) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> print('\uDBFF')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udbff' in position 0: surrogates not allowed

To be analyzed, but it needs to be sorted out one way or the other.

ZIM must contain only UTF-8, so there is no reason to pass it a unicode surrogate which is invalid UTF-8.