Closed claymfischer closed 4 years ago
We can login as: ssh ec2-user@tuatara.data.humancellatlas.org
$HOME = /home/ec2-user
Some configuration notes are in: $HOME/makedoc.txt
The github tuatara repo is checked out here: /home/ec2-user/hca-tuatara
The media directory which needs back is currently under that repo dir (but not checked into git): /home/ec2-user/media
(tuatara) [ec2-user@ip-172-31-24-7 hca-tuatara]$ df -h . Filesystem Size Used Avail Use% Mounted on /dev/xvda1 20G 2.4G 18G 12% /
We are currently using MariaDb :
MariaDB [(none)]> show global variables like 'datadir';
+---------------+-----------------+
| Variable_name | Value |
+---------------+-----------------+
| datadir | /var/lib/mysql/ |
+---------------+-----------------+
We are using innodb tables:
(tuatara) [ec2-user@ip-172-31-24-7 hca-tuatara]$ ls -l /var/lib/mysql
total 36912
-rw-rw---- 1 mysql mysql 16384 Sep 30 09:27 aria_log.00000001
-rw-rw---- 1 mysql mysql 52 Sep 30 09:27 aria_log_control
drwx------ 2 mysql mysql 8192 Nov 2 02:22 hcat
-rw-rw---- 1 mysql mysql 5242880 Nov 4 21:22 ib_logfile0
-rw-rw---- 1 mysql mysql 5242880 Sep 23 15:26 ib_logfile1
-rw-rw---- 1 mysql mysql 27262976 Nov 4 21:22 ibdata1
drwx------ 2 mysql mysql 4096 Sep 23 15:26 mysql
srwxrwxrwx 1 mysql mysql 0 Sep 30 17:45 mysql.sock
drwx------ 2 mysql mysql 4096 Sep 23 15:26 performance_schema
drwx------ 2 mysql mysql 6 Sep 23 15:26 test
drwx------ 2 mysql mysql 4096 Sep 24 12:14 test_
so that means that ibdata and ib_logfile files need backup.
OR we could have a cron job that dumps the hcat db tables as one big hcat.sql.gz dump output, and then have some other cron job that swings by and picks it up.
At a bare minimum we need that media dir and the mysql data to be backed up.
Backing up the rest of the stuff under $HOME seems useful and has some good settings.
hgsqldump --databases hcat > hcat.sql
For innodb tables, we can add: --single-transaction produces a checkpoint that allows the dump to capture all data prior to the checkpoint while receiving incoming changes. Those incoming changes do not become part of the dump. That ensures the same point-in-time for all tables.
This means we should be able to backup a consistent snapshot of the database without blocking the django application that is running.
I spent some time looking at different ways people backup django sites. There are some django backup packages which I considered and tried.
In the end, I decided to just use cron on some simple backup scripts that I made myself. This has worked well for me on other projects in the past.
I now run an hourly backup cron on ec2 production machine. Then other cron jobs on hgwdev under user tuatara copy that hourly backup from ec2 and store it on our hive storage system at ucsc. It makes hourly, daily, monthly, and yearly backups.
So far the backups are just a small amount of data. We do not expect it to grow greatly. The backups provide a way to recover in case of some hardware failure or mistake on the ec2 production machine.
Because we are using innodb and transaction feature, the database tables in the dump will be consistent.
cron backups are running fine.
A backup strategy for this could be quite simple, perhaps even a cronjob that leaves a zipfile available at a web URL. We'd want to backup the database often, I imagine daily. Leaving this to @galt to consider possible solutions.