openzim / zimit

Make a ZIM file from any Web site and surf offline!
GNU General Public License v3.0
359 stars 25 forks source link

Zimit WARC files of archives.nyphil.org are full of unrecognized characters #253

Open benoit74 opened 12 months ago

benoit74 commented 12 months ago

youzim.it run of https://archives.nyphil.org/ failed reporting lots of unrecognized chars.

Task is here.

Command used:

zimit --url=https://archives.nyphil.org/ --name=archives.nyphil.org_67aad441 --zim-file=archives.nyphil.org_67aad441.zim --userAgentSuffix=Youzim.it+ --sizeLimit=4294967296 --timeLimit=7200 --output=/output --statsFilename=/output/task_progress.json --adminEmail=contact+zimfarm@kiwix.org

Final error:

Traceback (most recent call last):
  File "/usr/bin/zimit", line 541, in <module>
    zimit()
  File "/usr/bin/zimit", line 443, in zimit
    return warc2zim(warc2zim_args)
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/main.py", line 811, in warc2zim
    return warc2zim.run()
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/main.py", line 433, in run
    self.add_items_for_warc_record(record)
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/main.py", line 646, in add_items_for_warc_record
    payload_item = WARCPayloadItem(record, self.head_insert, self.css_insert)
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/main.py", line 179, in __init__
    self.title = parse_title(self.content)
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/main.py", line 714, in parse_title
    soup = BeautifulSoup(content, "html.parser")
  File "/app/zimit/lib/python3.10/site-packages/bs4/__init__.py", line 348, in __init__
    self._feed()
  File "/app/zimit/lib/python3.10/site-packages/bs4/__init__.py", line 434, in _feed
    self.builder.feed(self.markup)
  File "/app/zimit/lib/python3.10/site-packages/bs4/builder/_htmlparser.py", line 377, in feed
    parser.feed(markup)
  File "/usr/lib/python3.10/html/parser.py", line 110, in feed
    self.goahead(0)
  File "/usr/lib/python3.10/html/parser.py", line 178, in goahead
    k = self.parse_html_declaration(i)
  File "/usr/lib/python3.10/html/parser.py", line 263, in parse_html_declaration
    return self.parse_marked_section(i)
  File "/usr/lib/python3.10/_markupbase.py", line 144, in parse_marked_section
    sectName, j = self._scan_name( i+3, i )
  File "/usr/lib/python3.10/_markupbase.py", line 390, in _scan_name
    raise AssertionError(
AssertionError: expected name token at '<![\x05�\x069�y�\x00"���@��\x11H'
FATAL: exception not rethrown

Before that, we have many times in the log:

[WARNING] Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
benoit74 commented 5 months ago

We need to run again the process with Zimit2 to confirm if issue is still present.