Stop relying on online archive.org Sites.xml

openzim / sotoki

StackExchange websites to ZIM scraper

https://library.kiwix.org/?category=stack_exchange

GNU General Public License v3.0

218 stars 25 forks source link

Stop relying on online archive.org Sites.xml #322

Open benoit74 opened 3 hours ago

benoit74 commented 3 hours ago

Now that archive.org is down, it becomes obvious that the scraper is down as well despite the fact that we have mirrored all dumps. And the reason is a bit sad: we did not mirrored Sites.xml and we are fetching it from online ... which is not online anymore.

I think we must mirror Sites.xml as well, and use the Sites.xml for our S3 bucket.

benoit74 commented 3 hours ago

Nota: we rely also on stackoverflow to be online in order to retrieve (at least):

the favicons (both normal + apple touch)
the primary and secondary css

While all these are obviously specific to every domain, I wonder if we should not mirror them as well (and source them from S3 in the scraper), to be able to run the scraper again even if StackExchange / Archive.org are down.