openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
18 stars 16 forks source link

Let's stop wget from retrying refused connections #93

Closed rimu closed 11 months ago

rimu commented 1 year ago

Instead, use the retrying module in the script that calls the save_large_file function.

rgaudin commented 1 year ago

Thank you for your contribution ;

Can you elaborate on why this change is necessary?

Scraperlib is used by many scrapers. Changing how downloading happens can not be taken lightly. Consequences of such a change must be clearly documented and depending on them, maybe use a param to toggle the behavior.

You're mentioning you'll be using retrying on the scraper you're working on. That's great but others are not (yet).

rimu commented 1 year ago

This PR is an attempt to work on this issue https://github.com/openzim/gutenberg/issues/142#issuecomment-1368853023.

Because we will be using the retrying module to retry, we don't want wget to retry (especially on hard errors like 'connection refused') because then there would be "retrying * wget" retries. Better to handle all the retrying on the gutenberg side.

However, as you say, if many other systems depend on scraperlib behaving a certain way, it is dangerous to change it here. If this PR is not suitable then perhaps I will remove the dependency on scraperlib from gutenberg, or find another way.

Thanks.

rgaudin commented 1 year ago

With the current behavior, wget will attempt downloading 4 additional times after a Connection Refused instead of giving up. Because we don't request delays, it all accounts for just a few seconds… and 4 requests to the server.

I thus have difficulties understanding why you don't just wrap the actual call with retrying. If you configure retrying to not retry on those, then you'd just lost ~5s. If you configure it to retry N times, then you'd lose N*5s.

rimu commented 1 year ago

Yes, the current behavior is undesirable, as explained here: https://github.com/openzim/gutenberg/issues/142#issuecomment-1382942748

benoit74 commented 1 year ago

I'm not sure that scraper lib behavior is unwanted.

What we have for now is an issue with the Gutenberg scraper behavior.

As @rgaudin said, there is 3 options.

Either we can consider this has to be fixed low level is the scraper lib. This represents a very significant risk but would benefit all scrapers at once. So clearly something to consider, but we need to provide a clear understanding of former behavior and new one, including many tests. We cannot just add a small code modification and state that this will enhance the situation without much arguments (understanding a documentation right is never something easy in my experience, I usually get it wrong at first)

Or we can add a new optional behavior (via a flag, a new method, ...) in scraper lib so that scrapers can progressively switch to the new behavior. This would allow to reduce the risk but provide a common implementation for everyone.

Or we can consider that it is scrapers responsibility to implement backoff / retrying and hence do not make any change in scraper lib.

I don't know what is exactly the vision on scraperlib scope. The main one is definitely to provide access to ZIM operations. I'm not sure it is really desirable to include more than that, including because it is a significant work to do.

rgaudin commented 1 year ago

I don't know what is exactly the vision on scraperlib scope. The main one is definitely to provide access to ZIM operations. I'm not sure it is really desirable to include more than that, including because it is a significant work to do.

The scope is to limit code duplication in the scrapers. Downloading being a big part of the scrapers, it makes sense to eventually include some retry helpers there.

The approach is usually to test stuff at scraper level then, once we're satisfied and have gained experience, we're ready to define how that could be integrated in scraperlib in a way that's easy to reuse from scrapers and easy to maintain as well.

At the moment, a couple scrapers are testing backoff but I'm not happy with it as it's not straightforward enough. I want to test retrying for instance.

Also, save_large_file() is kind of for special cases. It was added first because we weren't sure we could trust a simple python-requests based downloader for large downloads. Seemed easier at first to rely on some external, rock solid software. But most scrapers don't use it excepts for large dumps.

I'm not sure we want to keep it in the future. Download module is to be improved ; probably redefining the API with a better name/interface for the stream_file that we use everywhere and the inclusion of retries.

stale[bot] commented 1 year ago

This pull request has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 commented 11 months ago

@rimu @rgaudin Shiuld we close this PR for inactivity?

rimu commented 11 months ago

sure, ok