Open rgaudin opened 2 years ago
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
What are the practical consequences? That those creating Zimfarm recipes or running Zimit will have to be careful to define the scope carefully? Are we getting scrapes that are too small (or too big) as a result of this change? Is this at all related to the appearance of ZIM files that are too small in dev
and that have little more than a landing page?
No existing recipe would be affected because they were passing with the previous check so they don't have a redirect to an out-of-scope URL.
I imagine that recipe/requests with such a redirect would complete successfully within seconds and create a tiny ZIM but we should test the scenario to be sure.
I have a difficulty to judge the level of impact of this ticket/bug/problem? Can someone help me?
I don't think I can be more clear than the explanation above. Maybe reading the source code would help?
https://github.com/openzim/zimit/blob/c98e4505a898434c9f36d4b8afe3fe9244879637/zimit.py#L470-L490
@rgaudin Sorry, my question was not specific enough. I mean the quantity impact. Do we have a lot of scrapes impacted or only a few each year?
No idea and we can't really know: this information is just a warning in the logs. I don't think it would be much as users (at least ours) tend to copy-paste URL from a running browser so redirections are most likely resolved
Clearly not a 2.0 issue from my PoV, I never saw this happening in real situations.
Zimit 1.x, following #76 had a mechanism to ensure that should the passed URL redirect to an out-of-scope domain, the process would halt early as it would result in a barely usable ZIM (homepage not in ZIM).
With improvements to browsertrix-crawler,
--scope
has been removed in favor of a--scopeType
that can be:page
: Single URLpage-spa
: idem plus any fragment link to that URLprefix
(default): any URL that shares same prefix up to the last/
host
: any URL that shares same prefix up to the first/
domain
: Any URL on same domain or on any subdomain^^ (matched against non-www.
if it was present). ⚠️ uses URL port on every domains.any
: Anythingcustom
which uses--include
and--exclude
(regexp)Note that except for
page
that is a single URL, others automatically include bothhttp
andhttps
variants of matches.There's no documentation but here's implementation
With this new, complex scope mechanism, we had to remove our feature that checked if the redirected-to homepage is out-of-scope as it would require us to duplicate that whole scope code in zimit. Instead, a warning is displayed if the homepage is a redirection.
Question: is that enough? Do we want a different behavior? Should we duplicate that whole scope matching logic to fail early should target homepage be out-of-scope?