lincbrain / linc-archive

LINC API server and Web app
https://lincbrain.org
0 stars 0 forks source link

Backups #24

Open kabilar opened 10 months ago

kabilar commented 10 months ago

Requirements

Postgres database

Data

Example heroku command line tool

heroku pg:backups:capture --app linc-staging-terraform

Example Postgres pg-dump script

#!/bin/bash

# Database credentials
HOST="ec2-35-168-130-158.compute-1.amazonaws.com"
PORT="5432"
USER="u8cfndphbguhq8"
DBNAME="dcq75eotjue787"

# Get all tables
TABLES=$(psql -h $HOST -p $PORT -U $USER -d $DBNAME -t -c "SELECT tablename FROM pg_tables WHERE schemaname = 'public';")

# Backup data of each table
for TABLE in $TABLES; do
    pg_dump -h $HOST -p $PORT -U $USER -t $TABLE $DBNAME >> backup.sql
done
kabilar commented 10 months ago

Let's see if there are possible alternatives within AWS? Possibly duplicate bucket into cold storage?

aaronkanzer commented 9 months ago

@kabilar -- leaving some notes / research here, feel free to edit, append, etc.

Postgres:

Roni mentioned that we get continuous backups (purely via the Postgres instance type we are using) from https://devcenter.heroku.com/articles/heroku-postgres-data-safety-and-continuous-protection -- (for reference, we are using the standard-0 Postgres instance type -- code reference here from dandi-infrastructure

All Heroku Postgres databases are protected through continuous physical backups. These backups are stored in the same region as the database and retrieved through [Heroku Postgres Rollbacks](https://devcenter.heroku.com/articles/heroku-postgres-rollback) on Standard-tier or higher databases

We could go one step further and perform Heroku "logical" backups: https://devcenter.heroku.com/articles/heroku-postgres-logical-backups

This is pretty much Heroku's CLI tool making pg_dump commands more user-friendly. Depending on our preferences, we could invoke this command if desired -- it doesn't seem too costly, just another thing to maintain. I'm not sure there is much to gain here other than being vendor-agnostic with the pg_dump output -- e.g. we could apply the backup files to AWS-version of Postgres if one day we wanted to move away from Heroku, etc.

S3 Data:

S3 has a few mechanisms that we should use for data integrity & prevention of data loss.

Enable S3 Versioning -- this is simply a property that we switch "on" for the bucket. Version IDs are available to go back-in-time. I still need to check on long-term implications of pricing...

Object Lock -- this creates a potentially indefinite time in which an S3 object cannot be deleted or overwritten: https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html

MFA Deletion -- we should definitely use this on top of proper IAM permissions (especially when it comes to any mechanism that could delete the entire bucket -- spooky!)

I'm going to evaluate the Terraform code we have to configure our S3 bucket. I need to read more of the codebase, but ideally these things are already present/obtainable. Worst case scenario, we go into the AWS management console directly (e.g. circumvent any infrastructure-as-code) and edit in the short-term as a safeguard.

Assets / datalad / git-annex

We could implement this as well via https://github.com/dandi/backups2datalad -- started some initial conversations -- Yarik, John, Dartmouth seem to be the point persons: https://dandiarchive.slack.com/archives/GMRLT5RQ8/p1706046120749959

I think we should evaluate what utility we gain from datalad and S3 -- I'm assuming that the utility of datalad/git-annex is the ability to save portions of assets that have been changed rather than the entire asset itself; however, I'm curious if the requirements of LINC make the S3 storage sufficient (especially as datalad would be another "service" to technically manage, but perhaps it is a set-it-and-forget-it type of service we can intergrate) -- can discuss further

aaronkanzer commented 9 months ago

@kabilar (ignore this for now if you see it, enjoy the wedding! šŸ˜„ ) -- just noting so I don't forget here...

I did a proof-of-concept yesterday for setting up a cron job that performs manual pg_dump of our Heroku PostgresDB into an S3 bucket...and it works well!

See here:

https://github.com/lincbrain/linc-archive/blob/master/.github/workflows/pg-backup.yml

Can refine a bit more moving forward of course -- GPT made this a quick win for sure.

aaronkanzer commented 9 months ago

@kabilar

A bit more research/experimentation on the Object Lock and 2FA Deletion features in AWS:

Object Lock

While we were able to successfully use Terraform in our sandbox example for enabling Object Lock (https://github.com/lincbrain/linc-sandbox/blob/main/aws/bucket.tf), providing a similar flag in the pre-existing dandi-infrastructure failed somewhat silently (e.g. it seems Terraform tries to re-create the bucket; however, all buckets have the lifecycle {prevent_destroy = true} rule already, which essentially does what Object Lock would, but at the parent bucket level.

(just as an FYI, the main use case of lifecycle {prevent_destroy = true} is to prevent the command of terraform destroy -- e.g. what is called when you would remove a resource/code from a TF file -- upon a given resource.)

Nevertheless, we can still go into the AWS Management Console for the bucket and turn the feature on, thus, the conclusion here would be this is a Terraform issue, and not an AWS issue -- my recommendation here would be that we document the manual steps and turn the feature on for relevant buckets as a good safeguard.

There are two flags GOVERNANCE and COMPLIANCE -- my suggestion would be to use GOVERNANCE for the case of the LINC project, as COMPLIANCE removes any ability to delete data, whereas GOVERNANCE limits deletion to certain users.

2FA Deletion

AWS is like a fortress setting this up šŸ˜‚ -- e.g. only the root user can enable it (not even users with AdministratorAccess in AWS IAM). I messaged Satra for finding the root user/access.

MFA can only be enabled via the AWS CLI, with a command as such:

aws s3api put-bucket-versioning --bucket <bucket-name> --versioning-configuration Status=Enabled,MFADelete=Enabled --mfa "<my-device-arn:aws:iam::151312473579:mfa/<user>-iphone> <code-from-DUO"

The gain we get here is if any of our credentials were leaked, which I think is reason enough to set it up. This should be straightfoward once we get the root user account info.

As an aside, we should explore the S3 Intelligent Pricing feature in terms of Hot Storage -> Glacier -> Deep Glacier. I'm not sure how 1<>1 the DANDI Garbage collection is with the LINC project, so the S3 Intelligent Pricing feature could be quite useful for us.

Let me know if you have any questions/concerns in the meantime.

aaronkanzer commented 9 months ago

Cc @satra -- just wanted to notify you of this Github Issue, as it relates to some of the questions you've had regarding resiliency of the DANDI/LINC infrastructure, especially when it comes to data integrity and preservation in a worst-case-scenario, such as Terraform accidentally de-provisioning some infrastructure, a malicious actor getting credentials, overwrites, etc.

@kabilar and I will soon consolidate these thoughts into an organized design doc (that can perhaps extend some of these proof-of-concepts into DANDI), but just wanted you to be aware in the meantime.