Open rgaudin opened 2 months ago
To be precise, I believe it is not the libzim but the python-libzim which refuses to encode:
Python 3.12.1 (main, Jan 9 2024, 15:41:00) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> print('\uDBFF')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udbff' in position 0: surrogates not allowed
To be analyzed, but it needs to be sorted out one way or the other.
ZIM must contain only UTF-8, so there is no reason to pass it a unicode surrogate which is invalid UTF-8.
In this zimit run (logs gone) of https://www.meds.cl/, it seems the string added as content is not valid UTF-8 (
U+DBFF
in this case), having the libzim refuse it.This SO answer suggests this could be encoded JSON.
I believe this should be sorted out before passing it to libzim but I haven't dug into the issue