openzim / zimit

Make a ZIM file from any Web site and surf offline!
GNU General Public License v3.0
349 stars 24 forks source link

Out of scope homepage redirect #138

Open rgaudin opened 2 years ago

rgaudin commented 2 years ago

Zimit 1.x, following #76 had a mechanism to ensure that should the passed URL redirect to an out-of-scope domain, the process would halt early as it would result in a barely usable ZIM (homepage not in ZIM).

With improvements to browsertrix-crawler, --scope has been removed in favor of a --scopeType that can be:

Note that except for page that is a single URL, others automatically include both http and https variants of matches.

There's no documentation but here's implementation


With this new, complex scope mechanism, we had to remove our feature that checked if the redirected-to homepage is out-of-scope as it would require us to duplicate that whole scope code in zimit. Instead, a warning is displayed if the homepage is a redirection.

Question: is that enough? Do we want a different behavior? Should we duplicate that whole scope matching logic to fail early should target homepage be out-of-scope?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

Jaifroid commented 1 year ago

What are the practical consequences? That those creating Zimfarm recipes or running Zimit will have to be careful to define the scope carefully? Are we getting scrapes that are too small (or too big) as a result of this change? Is this at all related to the appearance of ZIM files that are too small in dev and that have little more than a landing page?

rgaudin commented 1 year ago

No existing recipe would be affected because they were passing with the previous check so they don't have a redirect to an out-of-scope URL.

I imagine that recipe/requests with such a redirect would complete successfully within seconds and create a tiny ZIM but we should test the scenario to be sure.

kelson42 commented 12 months ago

I have a difficulty to judge the level of impact of this ticket/bug/problem? Can someone help me?

rgaudin commented 12 months ago

I don't think I can be more clear than the explanation above. Maybe reading the source code would help?

https://github.com/openzim/zimit/blob/c98e4505a898434c9f36d4b8afe3fe9244879637/zimit.py#L470-L490

kelson42 commented 11 months ago

@rgaudin Sorry, my question was not specific enough. I mean the quantity impact. Do we have a lot of scrapes impacted or only a few each year?

rgaudin commented 11 months ago

No idea and we can't really know: this information is just a warning in the logs. I don't think it would be much as users (at least ours) tend to copy-paste URL from a running browser so redirections are most likely resolved

benoit74 commented 5 months ago

Clearly not a 2.0 issue from my PoV, I never saw this happening in real situations.