Open benoit74 opened 3 hours ago
Nota: we rely also on stackoverflow to be online in order to retrieve (at least):
While all these are obviously specific to every domain, I wonder if we should not mirror them as well (and source them from S3 in the scraper), to be able to run the scraper again even if StackExchange / Archive.org are down.
Now that archive.org is down, it becomes obvious that the scraper is down as well despite the fact that we have mirrored all dumps. And the reason is a bit sad: we did not mirrored
Sites.xml
and we are fetching it from online ... which is not online anymore.I think we must mirror Sites.xml as well, and use the Sites.xml for our S3 bucket.