Closed rimu closed 11 months ago
Thank you for your contribution ;
Can you elaborate on why this change is necessary?
Scraperlib is used by many scrapers. Changing how downloading happens can not be taken lightly. Consequences of such a change must be clearly documented and depending on them, maybe use a param to toggle the behavior.
You're mentioning you'll be using retrying
on the scraper you're working on. That's great but others are not (yet).
This PR is an attempt to work on this issue https://github.com/openzim/gutenberg/issues/142#issuecomment-1368853023.
Because we will be using the retrying module to retry, we don't want wget to retry (especially on hard errors like 'connection refused') because then there would be "retrying * wget" retries. Better to handle all the retrying on the gutenberg side.
However, as you say, if many other systems depend on scraperlib behaving a certain way, it is dangerous to change it here. If this PR is not suitable then perhaps I will remove the dependency on scraperlib from gutenberg, or find another way.
Thanks.
With the current behavior, wget will attempt downloading 4 additional times after a Connection Refused instead of giving up. Because we don't request delays, it all accounts for just a few seconds… and 4 requests to the server.
I thus have difficulties understanding why you don't just wrap the actual call with retrying
. If you configure retrying to not retry on those, then you'd just lost ~5s. If you configure it to retry N times, then you'd lose N*5s.
Yes, the current behavior is undesirable, as explained here: https://github.com/openzim/gutenberg/issues/142#issuecomment-1382942748
I'm not sure that scraper lib behavior is unwanted.
What we have for now is an issue with the Gutenberg scraper behavior.
As @rgaudin said, there is 3 options.
Either we can consider this has to be fixed low level is the scraper lib. This represents a very significant risk but would benefit all scrapers at once. So clearly something to consider, but we need to provide a clear understanding of former behavior and new one, including many tests. We cannot just add a small code modification and state that this will enhance the situation without much arguments (understanding a documentation right is never something easy in my experience, I usually get it wrong at first)
Or we can add a new optional behavior (via a flag, a new method, ...) in scraper lib so that scrapers can progressively switch to the new behavior. This would allow to reduce the risk but provide a common implementation for everyone.
Or we can consider that it is scrapers responsibility to implement backoff / retrying and hence do not make any change in scraper lib.
I don't know what is exactly the vision on scraperlib scope. The main one is definitely to provide access to ZIM operations. I'm not sure it is really desirable to include more than that, including because it is a significant work to do.
I don't know what is exactly the vision on scraperlib scope. The main one is definitely to provide access to ZIM operations. I'm not sure it is really desirable to include more than that, including because it is a significant work to do.
The scope is to limit code duplication in the scrapers. Downloading being a big part of the scrapers, it makes sense to eventually include some retry helpers there.
The approach is usually to test stuff at scraper level then, once we're satisfied and have gained experience, we're ready to define how that could be integrated in scraperlib in a way that's easy to reuse from scrapers and easy to maintain as well.
At the moment, a couple scrapers are testing backoff
but I'm not happy with it as it's not straightforward enough. I want to test retrying
for instance.
Also, save_large_file()
is kind of for special cases. It was added first because we weren't sure we could trust a simple python-requests based downloader for large downloads. Seemed easier at first to rely on some external, rock solid software. But most scrapers don't use it excepts for large dumps.
I'm not sure we want to keep it in the future. Download module is to be improved ; probably redefining the API with a better name/interface for the stream_file
that we use everywhere and the inclusion of retries.
This pull request has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
@rimu @rgaudin Shiuld we close this PR for inactivity?
sure, ok
Instead, use the
retrying
module in the script that calls the save_large_file function.