Closed rgaudin closed 3 years ago
Wild suggestion: maybe the right approach would be to be able to have a "downloader" part of the scraper which works in a similar way like the "uploader" (without blocking everything)?
I get how you came to that suggestion but I'm not in favor:
This issue was moved by kelson42 to openzim/zimfarm#621.
For sotoki scraper, we need to download dumps from StackExchange.
The server is backed by a few mirrors but all of them are very slow. The torrent included in this folder just refers those mirrors as webseed so it's equally slow.
Using this will greatly slow down our sotoki scraping while those files only changes twice a year (as per @kelson42 saying) and those are only 78GB as of this writing.
We should thus add a lightweight container to our infrastructure that would periodically (daily?) check whether the source repo had been updated and download each file. Those new files would then be served to our scraper, using a
--mirror
param or something.@kelson42 should we upload those to S3? As there is no versioning, we'd just be overwriting them each time.