Mirror stackexchange repo

rgaudin commented 3 years ago

For sotoki scraper, we need to download dumps from StackExchange.

The server is backed by a few mirrors but all of them are very slow. The torrent included in this folder just refers those mirrors as webseed so it's equally slow.

Using this will greatly slow down our sotoki scraping while those files only changes twice a year (as per @kelson42 saying) and those are only 78GB as of this writing.

We should thus add a lightweight container to our infrastructure that would periodically (daily?) check whether the source repo had been updated and download each file. Those new files would then be served to our scraper, using a --mirror param or something.

@kelson42 should we upload those to S3? As there is no versioning, we'd just be overwriting them each time.

kelson42 commented 3 years ago

Wild suggestion: maybe the right approach would be to be able to have a "downloader" part of the scraper which works in a similar way like the "uploader" (without blocking everything)?

rgaudin commented 3 years ago

I get how you came to that suggestion but I'm not in favor:

complexifies the scraper a bit
doesn't solve the problem: first run of the scraper will be super slow… will only benefit from multiple runs with same dump: will that even occur?
Messes up our durations on the zimfarm
Renders it all kinda unpredictable

ghost commented 3 years ago

This issue was moved by kelson42 to openzim/zimfarm#621.

kiwix / container-images

Mirror stackexchange repo #182