elliotcourant commented 9 months ago

It would be nice if monetr had a built in backup functionality, this would be particularly nice for self hosted instances where a user can simply run something like monetr backup .... The output of this should be a tar file that includes everything from the application.

Data from PostgreSQL
Uploaded files

Secrets however would be backed up in the form they are stored. Which means if you are using a KMS then those secrets might be lost if you lose access to that KMS (even if you have the backup).

elliotcourant commented 9 months ago

Because of how the tar file would need to be built, the entire dataset of monetr would need to fit on the disk the backup is running on. As it would need to write everything to a temporary path, then compress it.

elliotcourant commented 4 months ago

Some notes on this.

The two big things we want to backup are:

The database (atm PostgreSQL).
Files in whatever file store.

Tar files should be able to be written procedurally, but we need to know the size of each file in the tar ahead of time. This will be tricky with the postgresql backup, as we can't know the size of the pg_dump output until we have run it.

To fix that, we should create a pg backup wrapper, which executes pg_dump and then reads from the pg_dump's output in chunks of X bytes. Once X bytes have been read it creates a new tar header for that file size, and then writes those bytes to that file in the tarball. This will result in a single pg_dump sql file being many files in the tarball, each of them fragmented.

For example:

data/pg/0001.sql
data/pg/0002.sql

etc.

Each of these files could be cutoff in the middle of a query, so when we go to do a restore we have to consider that. Create an io.Reader interface that takes multiple io.Readers. It will read fromm the first one and then the next one etc, allowing all the "parts" of the postgresql backup to be read individually and streamed without needing to keep them all in memmory.

For files from whatever storage system we are using, we will know the file size and can simply stream that file into the tar with the exact file size in mind.

We would have another go routine that reads from the tar buffer in chunks and writes those to S3 in a multipart upload. All of this such that we do not need to keep very much in memory at all and can backup an incredibly large dataset easily (though maybe not quickly).

Examples

    // Create a pipe for the tar and gzip process
    reader, writer := io.Pipe()
    gzipWriter := gzip.NewWriter(writer)
    tarWriter := tar.NewWriter(gzipWriter)

This creates the pipe, each of our backup things would "write" to the tar writer which writes to the gzip writer which writes to the pipe. This way as we build our tar file we are flushing it through the flow of compression.

We would then have something like this:

func uploadTarStream(ctx context.Context, s3Client *s3.Client, reader io.Reader) error {
    uploader := s3.NewUploader(s3Client)

    partNumber := 1
    parts := []s3.CompletedPart{}
    buffer := make([]byte, chunkSize)

    for {
        bytesRead, err := reader.Read(buffer)
        if err != nil && err != io.EOF {
            return err
        }
        if bytesRead == 0 {
            break
        }

        uploadOutput, err := uploader.UploadPart(ctx, &s3.UploadPartInput{
            Bucket:     aws.String(targetObjectStoreBucket),
            Key:        aws.String(targetObjectStoreKey),
            PartNumber: int32(partNumber),
            Body:       bytes.NewReader(buffer[:bytesRead]),
        })
        if err != nil {
            return err
        }

        parts = append(parts, s3.CompletedPart{
            ETag:       uploadOutput.ETag,
            PartNumber: int32(partNumber),
        })

        partNumber++
    }

    _, err := uploader.CompleteMultipartUpload(ctx, &s3.CompleteMultipartUploadInput{
        Bucket: aws.String(targetObjectStoreBucket),
        Key:    aws.String(targetObjectStoreKey),
        MultipartUpload: &s3.CompletedMultipartUpload{
            Parts: parts,
        },
    })
    return err
}

Which reads from the pipe reader in chunks, and flushes those chunks to S3. This way we upload the tar as we go.

For writing files to the tar we would do something like this:

        header := &tar.Header{
            Name: inputObjectKey,
            Mode: 0600,
            Size: obj.Size,
        }

        if err := tarWriter.WriteHeader(header); err != nil {
            return err
        }

        if _, err := io.Copy(tarWriter, objectOutput.Body); err != nil {
            return err
        }

This takes the file from our storage system, takes its size, and writes it to our tar writer. As we do this it should be consumed by our uploader.

elliotcourant commented 4 months ago

More thoughts

Allow object storage to be skipped
Allow postgres or the database to be skipped
Allow secrets to be backed up unencrypted (good for self hosting?)
Backup object store must have a separate config from the normal object store. Even if they are the same.
Allow chunk size to be specified.

elliotcourant commented 4 months ago

Restore options

Allow only object storage to be restored
Allow only database to be restored
Allow secrets to be restored using a new encryption. (Requires that they were backed up as Plaintext or that the old encryption is still accessible).

monetr / monetr

feat: Builtin backup utility. #1684

Examples

More thoughts

Restore options