Add failure thresholds for missing links

openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format

GNU General Public License v3.0

44 stars 4 forks source link

Currently, warc2zim is very permissive regarding issues that may arise while rewriting documents.

This is mostly mandatory due to

the nature of website encountered in the wild which are not always well written
the fact that many URLs have been blocked by ad-blocker during crawl (for good reasons obviously)

However, warc2zim would probably benefit from a threshold mechanism to fail the scraper should issues be too numerous.

For instance, if it is clear that if more than xx% (10? 20?) of the links present on the homepage have failed to be rewritten, then it means the ZIM is most probably not usable at all. It should probably be feasible to make a distinction between missing resources (image, JS typically) which might come from an ad, and missing targets of hyperlinks (which are rarely from an ad, especially on home page).

The same threshold (or another value, still not clear) can probably be applied to other HTML pages.

Some experimentation is most probably needed to decide on the right thresholds to put in place, but warc2zim would benefit from this mostly naive QA feature, either because it is stupid to have a home page with many missing links or because it is hard to detect that crawler has been blocked at some point and many subpages are missing (e.g. 80% of the site is missing, but all links on the home page are present because they have been crawled first).

The problem this is supposed to fix is not well defined IMO. Title mentions “missing links”, first line refers to “rewriting documents”. Failure in rewriting a link and rewriting a document seems quite different to me.

In general, my feeling about this kind of feature is that it looks good on paper but is unusable in practice. Users (zimfarm recipe makers) can not be expected to guess an appropriate threshold for this, so they dont. We're left with meaningless values that are either never met or could be an internal validation check.

the nature of website encountered in the wild which are not always well written

If it leads somewhere, it's a warc2zim bug. If not, we should not care IMO

the fact that many URLs have been blocked by ad-blocker during crawl (for good reasons obviously)

We ran the crawl so if we decided to exclude stuff using an adblocker, we should handle it properly. I imagine we don't want failing zimfarm runs because some ads were not included but referenced in html.

To me, QA is twofold for zimit ZIMs:

creation phase where a human tests the output ZIM. This results in recipe tweaks and maybe tickets (or abandon). Once this phase is over, we have a prod ZIM.
update checks. There goes automated checks that ensures the recreation process is not broken.

It's a wider topic because our output is a Book that represents a Content but we get that off a recipe that's mostly an URL and some params. A website can dramatically change in between runs, for legitimate reasons or not. If the content on the website is not quite the same anymore, maybe it's not a ZIM update and maybe we don't want a ZIM of it.

openzim / warc2zim

Add failure thresholds for missing links #273