openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
40 stars 5 forks source link

Rewriting logic is trying to rewrite PDFs as HTML document #313

Closed benoit74 closed 2 weeks ago

benoit74 commented 2 weeks ago
warc2zim::2024-06-14 03:06:43,329] WARNING:Rewrite mode has changed in 2.0.1 for static.googleusercontent.com/media/fonts.google.com/en//knowledge/stop_stealing_sheep.pdf record: was None, now is html (mimetype: application/pdf, resourcetype: document)
[warc2zim::2024-06-14 03:08:28,837] ERROR:Problem encountered while processing https://static.googleusercontent.com/media/fonts.google.com/en//knowledge/stop_stealing_sheep.pdf.
Traceback (most recent call last):
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/content_rewriting/generic.py", line 85, in content_str
    result = to_string(self.content, self.encoding)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/utils.py", line 212, in to_string
    raise ValueError(f"Impossible to decode content {input_[:200]}")
ValueError: Impossible to decode content b'%PDF-1.6\r%\xe2\xe3\xcf\xd3\r\n13570 0 obj\r<</Filter/FlateDecode/First 153/Length 1009/N 15/Type/ObjStm>>stream\r\nh\xde\xd4Umo\xe36\x0c\xfe+\x02\xf6\xa5\x87\xa1\xa1,[\xb2\r\x1c\x02\xa4I\xd3v\xd7\xa4Y\x9d\xae\xc3\x82|P\x1d%\x15\xce/\x81\xad\xdc5\xf7\xebG\xcan\xbbn\xc3\x1dn\xdb\x97A\xb0EK\x0f\xc9\x874E\x05\xa1\x94\x11\xe3,\xc0Y\xb2 \x88\xbc\xa4\x98P\xd2K1\x0bU\xe0\xa5\x84\x85q\xb7\x9b\xb2(\x0cIR\x9cE\x89\xc7'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 324, in run
    self.add_items_for_warc_record(record)
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 742, in add_items_for_warc_record
    payload_item = WARCPayloadItem(
                   ^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/items.py", line 43, in __init__
    ).rewrite(pre_head_template, post_head_template)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/content_rewriting/generic.py", line 106, in rewrite
    return self.rewrite_html(pre_head_template, post_head_template)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/content_rewriting/generic.py", line 245, in rewrite_html
    ).rewrite(self.content_str)
              ^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/content_rewriting/generic.py", line 98, in content_str
    raise RuntimeError(f"Impossible to decode item {self.path.value}") from e
RuntimeError: Impossible to decode item static.googleusercontent.com/media/fonts.google.com/en//knowledge/stop_stealing_sheep.pdf
[warc2zim::2024-06-14 03:08:28,862] DEBUG:### REC Headers ###
WARC/1.1
WARC-Page-ID: f7c4f035-633c-4e0b-9f23-f67949511f95
WARC-Resource-Type: document
WARC-JSON-Metadata: {"cert":{"issuer":"GTS CA 1C3","ctc":"0"}}
WARC-Target-URI: https://static.googleusercontent.com/media/fonts.google.com/en//knowledge/stop_stealing_sheep.pdf
WARC-Date: 2024-06-13T11:52:43.975Z
WARC-Type: response
WARC-Record-ID: <urn:uuid:d43d5397-8803-49c7-a6df-00f5bfb35f3b>
Content-Type: application/http; msgtype=response
WARC-Payload-Digest: sha256:81c08290c3fbbba0f4153611d82125e5344c7722e3c7b9d8662fad1641511743
WARC-Block-Digest: sha256:7de451f702e87f4b4b069ca16d49a75a86333548e570a52eab559c57d3d9f654
Content-Length: 25609478

### HTTP Headers ###
HTTP/1.1 200 OK
accept-ranges: bytes
access-control-allow-origin: *
alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
cache-control: public, max-age=604800
content-length: 25608723
content-type: application/pdf
cross-origin-opener-policy-report-only: same-origin; report-to="apps-themes"
cross-origin-resource-policy: cross-origin
date: Thu, 13 Jun 2024 11:52:43 GMT
expires: Thu, 20 Jun 2024 11:52:43 GMT
last-modified: Thu, 20 Jul 2023 22:48:00 GMT
link: <https://fonts.google.com/knowledge/stop_stealing_sheep.pdf>; rel="canonical"
report-to: {"group":"apps-themes","max_age":2592000,"endpoints":[{"url":"https://csp.withgoogle.com/csp/report-to/apps-themes"}]}
server: sffe
x-content-type-options: nosniff
x-xss-protection: 0

### Content ###
Content has been stored b64-encoded at /output/fails/6a19e805-6973-4c06-86f2-93c844cc3f5e.pdf
[warc2zim::2024-06-14 03:08:28,862] ERROR:Scraper will stop. Pass --verbose flag for more details.
Traceback (most recent call last):
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/content_rewriting/generic.py", line 85, in content_str
    result = to_string(self.content, self.encoding)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/utils.py", line 212, in to_string
    raise ValueError(f"Impossible to decode content {input_[:200]}")
ValueError: Impossible to decode content b'%PDF-1.6\r%\xe2\xe3\xcf\xd3\r\n13570 0 obj\r<</Filter/FlateDecode/First 153/Length 1009/N 15/Type/ObjStm>>stream\r\nh\xde\xd4Umo\xe36\x0c\xfe+\x02\xf6\xa5\x87\xa1\xa1,[\xb2\r\x1c\x02\xa4I\xd3v\xd7\xa4Y\x9d\xae\xc3\x82|P\x1d%\x15\xce/\x81\xad\xdc5\xf7\xebG\xcan\xbbn\xc3\x1dn\xdb\x97A\xb0EK\x0f\xc9\x874E\x05\xa1\x94\x11\xe3,\xc0Y\xb2 \x88\xbc\xa4\x98P\xd2K1\x0bU\xe0\xa5\x84\x85q\xb7\x9b\xb2(\x0cIR\x9cE\x89\xc7'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/bin/zimit", line 8, in <module>
    sys.exit(zimit.zimit())
             ^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/zimit/zimit.py", line 585, in zimit
    run(sys.argv[1:])
  File "/app/zimit/lib/python3.12/site-packages/zimit/zimit.py", line 507, in run
    return warc2zim(warc2zim_args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/main.py", line 115, in main
    return converter.run()
           ^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 324, in run
    self.add_items_for_warc_record(record)
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 742, in add_items_for_warc_record
    payload_item = WARCPayloadItem(
                   ^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/items.py", line 43, in __init__
    ).rewrite(pre_head_template, post_head_template)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/content_rewriting/generic.py", line 106, in rewrite
    return self.rewrite_html(pre_head_template, post_head_template)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/content_rewriting/generic.py", line 245, in rewrite_html
    ).rewrite(self.content_str)
              ^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/content_rewriting/generic.py", line 98, in content_str
    raise RuntimeError(f"Impossible to decode item {self.path.value}") from e
RuntimeError: Impossible to decode item static.googleusercontent.com/media/fonts.google.com/en//knowledge/stop_stealing_sheep.pdf

In https://github.com/openzim/warc2zim/pull/306 we considered that everything which is a document in WARC-Resource-Type is indeed an HTML file. This is wrong, could be a PDF, plain text, ...