openzim / zimit

Make a ZIM file from any Web site and surf offline!
GNU General Public License v3.0
364 stars 25 forks source link

Invalid WARC Record #299

Open rgaudin opened 6 months ago

rgaudin commented 6 months ago

Opening here for you to triage ; run 67615 failed when warc2zim tried to load one of the WARC

Processing WARC files in /output/.tmpe5adwulz/collections/crawl-20240517013428468/archive
16 WARC files found
Calling warc2zim with these args: ['--name=www.biblehub.com_9b039781', '--zim-file=www.biblehub.com_9b039781.zim', '--publisher=openZIM', '--output', '/output', '--url', 'https://www.biblehub.com/', '-v', '--progress-file', '/output/warc2zim.json', '/output/.tmpe5adwulz/collections/crawl-20240517013428468/archive']
[DEBUG] Confirming output is writable using /output/tmpmjeyspt_
    WARNING: Record not followed by newline, perhaps Content-Length is invalid
    Offset: 528764777
    Remainder: b'\x17\xad\xb1\x19\x06\xbb>\x91D\xa9-\x1a\x17\xc8Qu(\xdf\xc2\x82-\x82\x9c!<_\xc2Qp\xd5&\x82L`,\xe8*\xf1\x92.\x9eqvbz\xa6\x91GB\x8c\xf3\xd3\xad\x88\xc5Cu\xcf\xe3\'\x7f\x85e\x1a\x9dTnaW:\xa9\xbe\xae\x19;^,\xca \xdc\xc4\x8e\t\xbcnT\xee\xadk Z$\x1e\x8d\x8c\x82\xd9\x92F0\x1e\xbfC\x8f f\xb9x\x80\xc7+oE.9V\xec1\x89}\x1a\r\xae\xb8j\x98\x93\x81\x8e\xfcW H\xf4S\x9bh\xb3\xd5\xf4a\xc8\xb2zy7\xaf\xc9\x91\xbc\x85\xded\xab\xad|[\xb3D\xa1~\xbc\xac\xe9\x8dP\x95U&])d\x97r\x97xV3#\xa9\x9b\x86\x9a\x98\xc0\xa9\x86\x1e\x19\x0c\x92\xe5\xd6\x92\xc4\xfd\xa1j\xdf\xf5\xf3Otup\xd1\xa4\xd2\xd7{\xb6\xe3\x86\xb87\xfehk\xd2\x19f\xb3s\xde\xd2\x97\xccP\x07\xd6;!\x1a\x94\x8d\x9e\x8f\x1c\x1bk|kVR\xae\xea\xc9R\x10\x99e]^\x02.b%U\x16\xa4\x12#\x82BT.\xc6\xa3\xf7\xae\xe9\xa2O\xbb\xa2Z\x0c\xa9\xad\x18U\xd8K\x8c\xf8\x155\xe3\x17\t;J\x9b\x03h7\x9cr\xaa=L5\x81~\r\x125Q\xb0\xddf\xb9\x8c\xc9[\xbb\xe7\xa9y\x0c\x86\xab\xa4\xdf\xbc\x9c\x1e\t\x04\x91\xba\x97S"\xf3Q\xdb\xc5\x89\x10;\xc9\xbe\x9d\x8a\xc1:t}\x9fr$(\xbfz&\\j)\xc0\xfb\xd6\x13\xcc\xc4\xf4$\xee\xae\'\xcb\x9d\xb0\x94L\x14\x0c[W\xae\x8b\x01`\xb7\xb2\x88(\xb6\x9b\xe9\xa0\xc1\x12h\xb55\x1d\xa1\x04\xa5eDI\\\xe2\xc9\x1dVvM\xbe\x83\xed\xc8\x0f\xcf\x91\xa1\x89\xe2e\x8b H\x8f\xbdJ%\xc6\x86\x00\xd1\x8e\x9d\x9d\\\x1f\xae\\\xebH\x1a\x9b\x97\x946\xf8\x90\xad\xae\xb5\xb6\x9e\xff\x14B\xe5OE4\xd8\xb5\xc4\x88\xeb\xd30\xc9\x93\xb2/\x8e\xfd\x16\xe5m \xdeg[Y@\xbb\xbb\x92\x19s\t\xea4Z\xaaH\xd8%\xd2F35\x15\xd8\x94\xe9S\xa0\xd4\xfb\xf2\x81\x8b\x80\xd7\xf09\xc4$\x84f\x13\xa8\xbc^9\\I\xb2\xcb\xe0\x9b4\x84\t\xe3\x1d\xc4\xc7\x96\x9c\x00\x90\x8a\n'
Traceback (most recent call last):
  File "/app/zimit/lib/python3.10/site-packages/warcio/recordloader.py", line 224, in _detect_type_load_headers
    rec_headers = self.warc_parser.parse(stream, statusline)
  File "/app/zimit/lib/python3.10/site-packages/warcio/statusandheaders.py", line 270, in parse
    raise StatusAndHeadersParserException(msg, full_statusline)
warcio.statusandheaders.StatusAndHeadersParserException: Expected Status Line starting with ['WARC/1.1', 'WARC/1.0', 'WARC/0.17', 'WARC/0.18'] - Found: 9Á‚c¢ Ï}¹»7É| @ˆ‚p€“Ž8|Na5B(û@Xä!”}¢ ÃX@N–2âóJ©p3@‰X“EQç¼
øM”%Ú4‹Šd¢$O5 •#*šæњumƒÈÝÈz&šÙ}•%“˜óq¸T–¿u4‘ÙøåÛ   ¹ªF(½ýæA•DƒÆ£ގ<ë.†|5·ç-@›‘V(dX%sˆ¤lgK¢’ê°Ñã)[Ðá~ê4)Iö©ŒAZ–ó¤ÅÚM·UçÎg$ݤIÏE£cWôÕLC»i3+iªÿa¤Ø±úl\¹d*éC2ÊPy•šj<Í5

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/zimit", line 566, in <module>
    zimit()
  File "/usr/bin/zimit", line 464, in zimit
    return warc2zim(warc2zim_args)
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/main.py", line 113, in main
    return converter.run()
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/converter.py", line 231, in run
    self.find_main_page_metadata()
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/converter.py", line 323, in find_main_page_metadata
    for record in self.iter_all_warc_records():
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/converter.py", line 320, in iter_all_warc_records
    yield from iter_warc_records(self.inputs)
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/converter.py", line 554, in iter_warc_records
    for record in buffering_record_iter(ArchiveIterator(fh), post_append=True):
  File "/app/zimit/lib/python3.10/site-packages/cdxj_indexer/bufferiter.py", line 17, in buffering_record_iter
    for record in record_iter:
  File "/app/zimit/lib/python3.10/site-packages/warcio/archiveiterator.py", line 110, in _iterate_records
    self.record = self._next_record(self.next_line)
  File "/app/zimit/lib/python3.10/site-packages/warcio/archiveiterator.py", line 257, in _next_record
    record = self.loader.parse_record_stream(self.reader,
  File "/app/zimit/lib/python3.10/site-packages/warcio/recordloader.py", line 86, in parse_record_stream
    _detect_type_load_headers(stream,
  File "/app/zimit/lib/python3.10/site-packages/warcio/recordloader.py", line 229, in _detect_type_load_headers
    raise ArchiveLoadFailed(msg + str(se.statusline))
warcio.exceptions.ArchiveLoadFailed: Invalid WARC record, first line: 9Á‚c¢ Ï}¹»7É| @ˆ‚p€“Ž8|Na5B(û@Xä!”}¢ ÃX@N–2âóJ©p3@‰X“EQç¼
øM”%Ú4‹Šd¢$O5 •#*šæњumƒÈÝÈz&šÙ}•%“˜óq¸T–¿u4‘ÙøåÛ   ¹ªF(½ýæA•DƒÆ£ގ<ë.†|5·ç-@›‘V(dX%sˆ¤lgK¢’ê°Ñã)[Ðá~ê4)Iö©ŒAZ–ó¤ÅÚM·UçÎg$ݤIÏE£cWôÕLC»i3+iªÿa¤Ø±úl\¹d*éC2ÊPy•šj<Í5

SIGINT/SIGTERM received, stopping zimit
benoit74 commented 6 months ago

Hard to tell ... looks like a corrupted WARC, but it is gone now ... we should run a crawl again to confirm it is reproducable or not