openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
40 stars 5 forks source link

Add support for real fuzzy matching #271

Open benoit74 opened 1 month ago

benoit74 commented 1 month ago

This issue is a placeholder for what looks like a potential enhancements warc2zim might need to implement as some point in the future (typically in a 3.x version). It is meant to summarize the current understanding of the situation and to document issues really encountered in the wild.

Current situation (as of warc2zim 2.0)

Currently, when statically and dynamically rewriting a URL (including when computing the ZIM path of a given WARC record) the scraper applies what is called fuzzy rules. This fuzzy rule term comes from wabac vocabulary.

However, currently the scraper does not really fuzzy match, but rather only simplify/transform the ZIM path (URL has already been transformed into a ZIM path when fuzzy rules are applied).

Sample rule (in Python):

  {
    "pattern": r".*googlevideo.com/(videoplayback(?=\?)).*[?&](id=[^&]+).*",
    "replace": r"youtube.fuzzy.replayweb.page/\1?\2",
  }

The scraper checks all fuzzy rules in the list configured. The first rule with a pattern matching the ZIM path currently being rewritten is used, and the ZIM path is replaced by the replace expression.

In static URL rewriting (Python), the rewritten ZIM path is then checked for existence within the list of expected ZIM entries, and if missing URL is not rewritten.

In dynamic URL rewriting (Javascript), we do not have the list of existing ZIM entries and hence always apply the rewriting.

Limitations

The problem with this approach appears when we have situations like graceful loading of image resolutions. E.g. the server has 4 image resolution available: image_thumb.png, image_low.png, image_med.png, image_high.png (this is a simplification / illustration of what is present on Youtube and Vimeo video placeholders, as well as article images on ir.voanews.com). The HTML document contains the image_thumb.png. Once DOM is loaded (typically), JS fires and replace the image src attribute with the proper image depending on viewport resolution (e.g. image_low.png on mobiles, image_med.png on tablets, image_high.png on desktop).

Since we usually scrape with a single device in Browsertrix crawler, the WARC contains only two images (usually image_thumb.png and image_low.png since we use a mobile device simulation).

But the problem is that when the user read the ZIM file on a desktop, the JS detects a big viewport and request the image_high.png.

Making this work is possible only with a hack: introduce two fuzzy rules:

Since the scraper stops on first matching fuzzy rule, the image_thumb.png will stay as image_thumb.png, and any other image will be rewritten to image_full.png since it did not matched the low res image.

While quite simple to implement, it comes with some limitations:

Conclusion so far

It is now quite clear that scraper could benefit from "real" fuzzy matching with more advanced matching rules, as expected at the very beginning of warc2zim2. It is also clear that it is not a small feature request.

As mentioned in the introduction, I do not expect that anything will be implemented soon on this issue, but rather to continue documenting issues encountered in the wild and hacks implemented to cope with the situation.

benoit74 commented 1 month ago

One sample case (Youtube thumbnails/placeholders images for videos in embedded player) where current fuzzy rules system is insufficient: https://github.com/openzim/warc2zim/issues/262#issuecomment-2124084341