montera34 / pageonex

PageOneX. Analyzing front pages
http://pageonex.com
GNU Affero General Public License v3.0
53 stars 13 forks source link

clarify backup strategy #221

Open rahulbot opened 5 years ago

rahulbot commented 5 years ago

Looks like the main database is being backed up to our civic-media S3 backups folder. Are the images being backed up anywhere? Should they be?

In case of catastrophic hard drive failure, you could theoretically you could pull all the images from kiosko directly again with a script.

Could you regenerate all the overall images from coordinate information in the database?

Those two image folders add up to almost 200gigs, so backing it up wouldn't be trivial. Thoughts?

numeroteca commented 5 years ago

Looks like the main database is being backed up to our civic-media S3 backups folder.

Good to know. Please confirm it is there. I'd like to have a copy to run local tests.

Are the images being backed up anywhere? Should they be?

Nope, there is no backup for images that I know.

In case of catastrophic hard drive failure, you could theoretically you could pull all the images from kiosko directly again with a script. Could you regenerate all the overall images from coordinate information in the database?

For every image we have the source_url in kisoko (apart from: publication_date, image_name, size and media). The problem would be if kiosko.net disappears or closes.

Those two image folders add up to almost 200gigs, so backing it up wouldn't be trivial. Thoughts?

The two folders are kiosko images and thread images? Estimated size of each directory?

Once in a while it good be good to have a backup, offline. For /kiokso, maybe only backuping images that are in threads with coded images would make it smaller? For /threads, not all the files are needed, as some (or all of them) are easily regenerated.

rahulbot commented 5 years ago

Verified that the database is backing up (as part of #216).

The kiosko dir right now is 176GB, the threads dir is 15GB.

What do you think the right approach is for backing up these directories? My read of the S3 pricing page/ suggests that:

So maybe the right approach is to have you queue up a task to write a script that tar's up all the used kiosko images and thread files into a giant tarball. Then I can set up a cron job to run the script and upload the result to Amazon S3 once a month or so. How does that sound?

rporres commented 5 years ago

Taking into account that the vast majority of files won't change, you may be interested in something like rclone. It has an S3 backend