Initial check of URL is causing issues

In zimit, at the beginning of scraper execution it performs what is named a check_url: https://github.com/openzim/zimit/blob/a62f31ed0dd5250e458d92c90c782ef2acfb0131/zimit.py#L467-L508

This check seems intended to check URL validity and clean it, including by following redirects.

It is however doing some harm: since the request is done by Python requests library, the anti-bot protection are regularly triggered.

See #255 for instance where removing the check_url (manually on my machine) allows Browsertrix to proceed (even if I'm not sure it will finish, protections might stop us at some points). Same problem occurs in #232. And we have many cases reported in the weekly routine where the youzim.it task is stopped by a Python error, i.e. something which happened in check_url.

We tried to enhance the situation with https://github.com/openzim/zimit/pull/229 and while it is way better now, it is still not sufficient. Advanced anti-bot protections are not tricked by the agent and they still identifies us as a bot (probably via TLS fingerprinting techniques).

I'm not sure how to move this forward, but clearly there is something to do.

I wonder if we should simply remove this check of URLs, it seems to me this is doing more harm than good and it is the user responsibility to input proper URLs. Do we have any notes / remembering of why this was introduced in details?

Note that doing the check and just not caring about errors returns is not sufficient, since doing the check usually trigger a temporary ban of our scraper IP.

Another option would be to introduce a CLI flag to optionally disable this check, but I feel like this scraper already has too many flags, and on youzim.it it will be hard for the end-user to know he should disable this check. And if he just ran the scraper with the check, the IP might be banned and he will have to wait (without really knowing about it) before running the scraper without the check.

openzim / zimit

Initial check of URL is causing issues #256