Open benoit74 opened 6 months ago
One sample case (Youtube thumbnails/placeholders images for videos in embedded player) where current fuzzy rules system is insufficient: https://github.com/openzim/warc2zim/issues/262#issuecomment-2124084341
Another sample case: https://github.com/openzim/zim-requests/issues/833#issuecomment-2203184866
This issue is a placeholder for what looks like a potential enhancements warc2zim might need to implement as some point in the future (typically in a 3.x version). It is meant to summarize the current understanding of the situation and to document issues really encountered in the wild.
Current situation (as of warc2zim 2.0)
Currently, when statically and dynamically rewriting a URL (including when computing the ZIM path of a given WARC record) the scraper applies what is called fuzzy rules. This fuzzy rule term comes from wabac vocabulary.
However, currently the scraper does not really fuzzy match, but rather only simplify/transform the ZIM path (URL has already been transformed into a ZIM path when fuzzy rules are applied).
Sample rule (in Python):
The scraper checks all fuzzy rules in the list configured. The first rule with a
pattern
matching the ZIM path currently being rewritten is used, and the ZIM path is replaced by thereplace
expression.In static URL rewriting (Python), the rewritten ZIM path is then checked for existence within the list of expected ZIM entries, and if missing URL is not rewritten.
In dynamic URL rewriting (Javascript), we do not have the list of existing ZIM entries and hence always apply the rewriting.
Limitations
The problem with this approach appears when we have situations like graceful loading of image resolutions. E.g. the server has 4 image resolution available:
image_thumb.png
,image_low.png
,image_med.png
,image_high.png
(this is a simplification / illustration of what is present on Youtube and Vimeo video placeholders, as well as article images on ir.voanews.com). The HTML document contains theimage_thumb.png
. Once DOM is loaded (typically), JS fires and replace the imagesrc
attribute with the proper image depending on viewport resolution (e.g.image_low.png
on mobiles,image_med.png
on tablets,image_high.png
on desktop).Since we usually scrape with a single device in Browsertrix crawler, the WARC contains only two images (usually
image_thumb.png
andimage_low.png
since we use a mobile device simulation).But the problem is that when the user read the ZIM file on a desktop, the JS detects a big viewport and request the
image_high.png
.Making this work is possible only with a hack: introduce two fuzzy rules:
image_thumb.png
in our example) so that scraper does not really rewrites this URLimage_.*.png
) to rewrite this toimage_full.png
for instanceSince the scraper stops on first matching fuzzy rule, the
image_thumb.png
will stay asimage_thumb.png
, and any other image will be rewritten toimage_full.png
since it did not matched the low res image.While quite simple to implement, it comes with some limitations:
thumb
suffix is also dynamic, based on viewport resolution (seen on Vimeo placeholder at least), or at least it would make the fuzzy rule very fragile since depending on which mobile device is used for crawlingConclusion so far
It is now quite clear that scraper could benefit from "real" fuzzy matching with more advanced matching rules, as expected at the very beginning of warc2zim2. It is also clear that it is not a small feature request.
As mentioned in the introduction, I do not expect that anything will be implemented soon on this issue, but rather to continue documenting issues encountered in the wild and hacks implemented to cope with the situation.