openzim / zimit

Make a ZIM file from any Web site and surf offline!
GNU General Public License v3.0
359 stars 25 forks source link

Initial check of URL is causing issues #256

Closed benoit74 closed 7 months ago

benoit74 commented 11 months ago

In zimit, at the beginning of scraper execution it performs what is named a check_url: https://github.com/openzim/zimit/blob/a62f31ed0dd5250e458d92c90c782ef2acfb0131/zimit.py#L467-L508

This check seems intended to check URL validity and clean it, including by following redirects.

It is however doing some harm: since the request is done by Python requests library, the anti-bot protection are regularly triggered.

See #255 for instance where removing the check_url (manually on my machine) allows Browsertrix to proceed (even if I'm not sure it will finish, protections might stop us at some points). Same problem occurs in #232. And we have many cases reported in the weekly routine where the youzim.it task is stopped by a Python error, i.e. something which happened in check_url.

We tried to enhance the situation with https://github.com/openzim/zimit/pull/229 and while it is way better now, it is still not sufficient. Advanced anti-bot protections are not tricked by the agent and they still identifies us as a bot (probably via TLS fingerprinting techniques).

I'm not sure how to move this forward, but clearly there is something to do.

I wonder if we should simply remove this check of URLs, it seems to me this is doing more harm than good and it is the user responsibility to input proper URLs. Do we have any notes / remembering of why this was introduced in details?

Note that doing the check and just not caring about errors returns is not sufficient, since doing the check usually trigger a temporary ban of our scraper IP.

Another option would be to introduce a CLI flag to optionally disable this check, but I feel like this scraper already has too many flags, and on youzim.it it will be hard for the end-user to know he should disable this check. And if he just ran the scraper with the check, the IP might be banned and he will have to wait (without really knowing about it) before running the scraper without the check.

benoit74 commented 8 months ago

This might be solved with the upgrade to 1.0.0-beta5 where we will probably be able to remove the check_url operation since the redirect thing will be handled by browsertrix now and will be considered as a seed (and hence not suffer from scope issues).

The only thing we probably keep is the removal of default 443 and 80 ports.

To be confirmed in the PR for upgrading to 1.0.0-beta5