openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
40 stars 5 forks source link

Use Warc-Resource-Type header to decide how to rewrite a WARC record #296

Closed benoit74 closed 2 weeks ago

benoit74 commented 3 weeks ago

Logs: Impossible to decode item www.synology.com/font/fontawesome-webfont.woff2?v=4.7.0

URL: https://www.synology.com/en-br

benoit74 commented 3 weeks ago

Resource is present at https://www.synology.com/font/fontawesome-webfont.woff2?v=4.7.0

Failure occurs when trying to include the resource in the ZIM, considering it might have to be rewritten (HTML/JS/CSS ...).

Stacktrace is something like this (this has been reproduced locally at https://github.com/openzim/warc2zim/commit/060cbd6903665c6601ed3a6d5f7afc2b6320e831):

[warc2zim::2024-06-03 18:37:10,694] INFO:Expecting 7252 ZIM entries including redirects
[warc2zim::2024-06-03 18:37:12,041] ERROR:Problem encountered while processing https://www.synology.com/font/fontawesome-webfont.woff2?v=4.7.0.
Traceback (most recent call last):
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 85, in content_str
    result = to_string(self.content, self.encoding)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/utils.py", line 212, in to_string
    raise ValueError(f"Impossible to decode content {input_[:200]}")
ValueError: Impossible to decode content b'wOF2\x00\x01\x00\x00\x00\x01-h\x00\r\x00\x00\x00\x02\x00\x01-\x0e\x00\x04\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00?FFTM\x1c\x1a \x06`\x00r\x11\x08\n(X\x016\x02$\x03p\x0b\x10\x00\x04 \x05\x06\x07u[R\trGa\r\':\x1a&=r*\n\x02\x19\x07nF|\x14\x08fm`$\xd8\x91@d[BQ\x11$([U<+(@P\x1e\x0e;lh\xd4\xa8y%\xdb\x81^\x14G3\x12nDp\\Yr Lt)6R"S\x0bL~CXR\x15\t4y\\[\x1ds\xe0\xbb\x8cq\x1eM%K\x17.\xdb\xba\x0e,\x0bt\'M\x1d,\x11\x15cs^.\x07\x0ch&gb\'\x0f6:'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/converter.py", line 321, in run
    self.add_items_for_warc_record(record)
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/converter.py", line 733, in add_items_for_warc_record
    payload_item = WARCPayloadItem(
                   ^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/items.py", line 43, in __init__
    ).rewrite(pre_head_template, post_head_template)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 106, in rewrite
    return self.rewrite_html(pre_head_template, post_head_template)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 185, in rewrite_html
    ).rewrite(self.content_str)
              ^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 98, in content_str
    raise RuntimeError(f"Impossible to decode item {self.path.value}") from e
RuntimeError: Impossible to decode item www.synology.com/font/fontawesome-webfont.woff2?v=4.7.0
[warc2zim::2024-06-03 18:37:12,042] ERROR:Scraper will stop. Pass --verbose flag for more details.
Traceback (most recent call last):
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 85, in content_str
    result = to_string(self.content, self.encoding)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/utils.py", line 212, in to_string
    raise ValueError(f"Impossible to decode content {input_[:200]}")
ValueError: Impossible to decode content b'wOF2\x00\x01\x00\x00\x00\x01-h\x00\r\x00\x00\x00\x02\x00\x01-\x0e\x00\x04\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00?FFTM\x1c\x1a \x06`\x00r\x11\x08\n(X\x016\x02$\x03p\x0b\x10\x00\x04 \x05\x06\x07u[R\trGa\r\':\x1a&=r*\n\x02\x19\x07nF|\x14\x08fm`$\xd8\x91@d[BQ\x11$([U<+(@P\x1e\x0e;lh\xd4\xa8y%\xdb\x81^\x14G3\x12nDp\\Yr Lt)6R"S\x0bL~CXR\x15\t4y\\[\x1ds\xe0\xbb\x8cq\x1eM%K\x17.\xdb\xba\x0e,\x0bt\'M\x1d,\x11\x15cs^.\x07\x0ch&gb\'\x0f6:'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/benoit/Repos/openzim/warc2zim/.hatch/warc2zim/bin/warc2zim", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/main.py", line 115, in main
    return converter.run()
           ^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/converter.py", line 321, in run
    self.add_items_for_warc_record(record)
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/converter.py", line 733, in add_items_for_warc_record
    payload_item = WARCPayloadItem(
                   ^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/items.py", line 43, in __init__
    ).rewrite(pre_head_template, post_head_template)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 106, in rewrite
    return self.rewrite_html(pre_head_template, post_head_template)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 185, in rewrite_html
    ).rewrite(self.content_str)
              ^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 98, in content_str
    raise RuntimeError(f"Impossible to decode item {self.path.value}") from e
RuntimeError: Impossible to decode item www.synology.com/font/fontawesome-webfont.woff2?v=4.7.0

The scraper hence considered this had to be rewritten as HTML, trying to get a decoded string from the binary content of the woff2 policy ... which fails for obvious.

These are the details we have about the WARC record:

### REC Headers ###
WARC/1.1
WARC-Page-ID: 593863b3-215a-4b5d-883c-e42296b62846
WARC-Resource-Type: font
WARC-JSON-Metadata: {"ipType":"Public"}
WARC-Target-URI: https://www.synology.com/font/fontawesome-webfont.woff2?v=4.7.0
WARC-Date: 2024-06-03T15:09:58.603Z
WARC-Type: response
WARC-Record-ID: <urn:uuid:bf04c2f7-2efb-4e15-ae8c-d7d5663f6cdd>
Content-Type: application/http; msgtype=response
WARC-Payload-Digest: sha256:d5dbe350d2e95210ec0e04b251afb682403dbb851f7e408778fd509498511bf4
WARC-Block-Digest: sha256:16c28808ec3911005aacf07d250ca06c98bedefd2adaac5f56ba2b26f2b0859f
Content-Length: 33418

### HTTP Headers ###
HTTP/1.1 200 OK
content-type: text/html
server: nginx
last-modified: Mon, 21 Jun 2021 08:56:33 GMT
strict-transport-security: max-age=31536000; preload
x-content-type-options: nosniff
x-xss-protection: 1; mode=block
date: Sun, 02 Jun 2024 20:30:41 GMT
etag: W/"60d05441-12d68"
vary: Accept-Encoding
x-cache: Hit from cloudfront
via: 1.1 79b38e01cf5e16de2ad2a0ec2187e7f4.cloudfront.net (CloudFront)
x-amz-cf-pop: HEL50-C2
x-amz-cf-id: GYC8i3zVgw31oKQx5PWzHPVKU_9buT1NhGGNjmZuZvpjcqPmM_f5ZA==
age: 74325

As one can see, the content-type returned by the webserver is wrong, text/html is not the correct mimetype.

Currently the scraper uses this mimetype (from the content-type response header) to decide if / how the WARC record needs to be rewritten: https://github.com/openzim/warc2zim/blob/060cbd6903665c6601ed3a6d5f7afc2b6320e831/src/warc2zim/content_rewriting/generic.py#L124-L150

Only basing the decision on the content-type header is obviously a tradeoff between rewriting too much (as here) or too little (not rewriting something because we consider it doesn't need to be while it was needed in fact).

I propose to however be more resilient by taking benefit of the new WARC-Resource-Type WARC header now available, and coming from https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType ; since this explains how the browser considered the resource for its own usage, it is clearly more in line with the information we need.

I propose to alter the logic to:

This can clearly wait for 2.1, since core problem is that the server is lying to the scraper + such a change will need a bit of testing before declaring it has only expected impact.

rgaudin commented 3 weeks ago

LGTM except we are a bit unclear on the impact, as you said.

I think it's a better approach than current one as there is no obligation to return a content-type nor to return a valid one. It's conventions and with the professionalization of the web and the weight of tech giants, it is now mainstream.

But zimit goal is a browsing fidelity one, not a tech-spec-validator, so whatever works in the browser should be the goal. In that sense, using those hints from the browser makes a lot more sense and should be preferred when available.

benoit74 commented 3 weeks ago

I just realized we could (and should probably) easily keep both approach in parallel for the 2.1, use the result from the new approach but raise WARNINGs when the result of the two approaches are different. This will help to check for non-regression during 2.1 tests AND help to diagnose problems in production once 2.1 will be released

benoit74 commented 3 weeks ago

This also caused the failure of https://farm.openzim.org/pipeline/32a2ad19-1ceb-4679-9d16-0b7d92f46c23