kiwix / container-images

10 stars 4 forks source link

Mirror stackexchange repo #182

Closed rgaudin closed 3 years ago

rgaudin commented 3 years ago

For sotoki scraper, we need to download dumps from StackExchange.

The server is backed by a few mirrors but all of them are very slow. The torrent included in this folder just refers those mirrors as webseed so it's equally slow.

Using this will greatly slow down our sotoki scraping while those files only changes twice a year (as per @kelson42 saying) and those are only 78GB as of this writing.

We should thus add a lightweight container to our infrastructure that would periodically (daily?) check whether the source repo had been updated and download each file. Those new files would then be served to our scraper, using a --mirror param or something.

@kelson42 should we upload those to S3? As there is no versioning, we'd just be overwriting them each time.

kelson42 commented 3 years ago

Wild suggestion: maybe the right approach would be to be able to have a "downloader" part of the scraper which works in a similar way like the "uploader" (without blocking everything)?

rgaudin commented 3 years ago

I get how you came to that suggestion but I'm not in favor:

ghost commented 3 years ago

This issue was moved by kelson42 to openzim/zimfarm#621.