Decide on backup strategy and implement

claymfischer commented 4 years ago

A backup strategy for this could be quite simple, perhaps even a cronjob that leaves a zipfile available at a web URL. We'd want to backup the database often, I imagine daily. Leaving this to @galt to consider possible solutions.

galt commented 4 years ago

We can login as: ssh ec2-user@tuatara.data.humancellatlas.org

$HOME = /home/ec2-user

Some configuration notes are in: $HOME/makedoc.txt

The github tuatara repo is checked out here: /home/ec2-user/hca-tuatara

The media directory which needs back is currently under that repo dir (but not checked into git): /home/ec2-user/media

(tuatara) [ec2-user@ip-172-31-24-7 hca-tuatara]$ df -h . Filesystem Size Used Avail Use% Mounted on /dev/xvda1 20G 2.4G 18G 12% /

We are currently using MariaDb :

MariaDB [(none)]> show global variables like 'datadir';
+---------------+-----------------+
| Variable_name | Value           |
+---------------+-----------------+
| datadir       | /var/lib/mysql/ |
+---------------+-----------------+

We are using innodb tables:

(tuatara) [ec2-user@ip-172-31-24-7 hca-tuatara]$ ls -l /var/lib/mysql
total 36912
-rw-rw---- 1 mysql mysql    16384 Sep 30 09:27 aria_log.00000001
-rw-rw---- 1 mysql mysql       52 Sep 30 09:27 aria_log_control
drwx------ 2 mysql mysql     8192 Nov  2 02:22 hcat
-rw-rw---- 1 mysql mysql  5242880 Nov  4 21:22 ib_logfile0
-rw-rw---- 1 mysql mysql  5242880 Sep 23 15:26 ib_logfile1
-rw-rw---- 1 mysql mysql 27262976 Nov  4 21:22 ibdata1
drwx------ 2 mysql mysql     4096 Sep 23 15:26 mysql
srwxrwxrwx 1 mysql mysql        0 Sep 30 17:45 mysql.sock
drwx------ 2 mysql mysql     4096 Sep 23 15:26 performance_schema
drwx------ 2 mysql mysql        6 Sep 23 15:26 test
drwx------ 2 mysql mysql     4096 Sep 24 12:14 test_

so that means that ibdata and ib_logfile files need backup.

OR we could have a cron job that dumps the hcat db tables as one big hcat.sql.gz dump output, and then have some other cron job that swings by and picks it up.

At a bare minimum we need that media dir and the mysql data to be backed up.

Backing up the rest of the stuff under $HOME seems useful and has some good settings.

galt commented 4 years ago

hgsqldump --databases hcat > hcat.sql

For innodb tables, we can add: --single-transaction produces a checkpoint that allows the dump to capture all data prior to the checkpoint while receiving incoming changes. Those incoming changes do not become part of the dump. That ensures the same point-in-time for all tables.

This means we should be able to backup a consistent snapshot of the database without blocking the django application that is running.

galt commented 4 years ago

I spent some time looking at different ways people backup django sites. There are some django backup packages which I considered and tried.

In the end, I decided to just use cron on some simple backup scripts that I made myself. This has worked well for me on other projects in the past.

I now run an hourly backup cron on ec2 production machine. Then other cron jobs on hgwdev under user tuatara copy that hourly backup from ec2 and store it on our hive storage system at ucsc. It makes hourly, daily, monthly, and yearly backups.

So far the backups are just a small amount of data. We do not expect it to grow greatly. The backups provide a way to recover in case of some hardware failure or mistake on the ec2 production machine.

Because we are using innodb and transaction feature, the database tables in the dump will be consistent.

galt commented 4 years ago

cron backups are running fine.

gi-kent-content / hca-tuatara

Decide on backup strategy and implement #20