Reduce storage footprint when using GPG encryption

offen / docker-volume-backup

Backup Docker volumes locally or to any S3, WebDAV, Azure Blob Storage, Dropbox or SSH compatible storage

https://offen.github.io/docker-volume-backup/

Mozilla Public License 2.0

1.87k stars 81 forks source link

Reduce storage footprint when using GPG encryption #95

Open simboel opened 2 years ago

simboel commented 2 years ago

I'm submitting a ...
- [ ] bug report
- [x] feature request
- [ ] support request
What is the current behavior?

When using GPG_PASSPHRASE this tool will create two files in the /tmp directory:

The backup-*.tar.gz file
An decrypted backup-*.tar.gz.gpg

If the current behavior is a bug, please provide the configuration and steps to reproduce and if possible a minimal demo of the problem.

It's not a bug, but it will lead to a high storage usage. This is problematic on restricted servers for example, where storage space is expensive. Example: On one of my servers I've got 90 GiB storage. With the current implementation the server application itself must not exceed 30 GiB or the server would encounter an out of space error when executing the backup.

What is the expected behavior?

Not expected but an idea: When using GPG_PASSPHRASE the result of the tar command could (probably) directly be piped to gpg. This way only the *.gpg file would be saved to the storage.

What is the motivation / use case for changing the behavior?
Please tell us about your environment:
- Image version: v2.15.2
- Docker version:
- docker-compose version:
Other information (e.g. detailed explanation, stacktraces, related issues, suggestions how to fix, links for us to have context, eg. stackoverflow, etc)

m90 commented 2 years ago

When using GPG_PASSPHRASE the result of the tar command could (probably) directly be piped to gpg

Since this is not a bash script, it's not that easy, but it could still be done without having to resort to using the intermediate file and instead chain the tarWriter and the OpenPgpWriter. Maybe it would even be possible to upload the file in a streaming manner for some backends and remove the need for an intermediate file entirely.

This is a little complicated though (mostly because the code is written with that mechanism in mind), so it's much more than a one-line change.

If anyone wants to pick this up, I am happy to review and merge in PRs. Else I might be able to look into this at some point, but I cannot really make any estimates.

A side note question: the artifacts do get deleted properly after a backup run, right? So the storage footprint is only an issue while the backup is running, correct?

simboel commented 2 years ago

Thanks for the reply. Chaining the upload would've been my next question 👍

Maybe I can find some time to start developing this feature.

m90 commented 2 years ago

On a very high level this could work like this (from the top of my head, so it might be wrong in certain details):

(*script).file becomes an io.ReadWriter (?) instead of a string pointing to a file
methods on *script can now operate on this writer instead of reading / writing from the file
in case the storage backend requires to work off a file, the ReadWriter can be flushed to an intermediate file, in case the backend supports streaming, it can be piped through

In case you decide to start working on this feel free to ask questions any time and also don't shy away from improving the pipeline if you find things that are rather odd right now.

m90 commented 2 years ago

I looked into how this could implemented a little and found a tricky detail hidden: if a streaming pipeline would be used that possibly does create tar archive -> possibly encrypt -> upload this would also mean that users who want to stop their containers during backup might see a considerable increase in downtime as this now means we could see backpressure or similar. I think it's possible to work around this mostly (emit an event once the tar writer stopped writing or similar), but this makes the refactoring needed even bigger.

I'll try to think about how the entire script orchestration could be refactored so it can accomodate for such requirements.

MaxJa4 commented 1 year ago

Maybe different modes, which the user can select, would make sense here:

Stream mode: Archive + Encrypt + Upload all at once as a pipeline (if the storage backend supports it)
~~Hybrid mode: Archive + Encrypt -> Upload (Maybe speedup by using more CPU resources by archiving and encrypting simultaneously)~~
Sequential mode: Classic Archive -> Encrypt -> Upload (like now; default?)
Adaptive mode: Archive + Encrypt + Upload all at once but also build up an adjustable buffer of Archive+Encrypt (either a set amount or free storage space minus x%) so the container can be started again sooner, as the upload can use the remaining buffer

Adaptive mode would solve the backpressure issue. For example with Backblaze B2 I have usually 250 MBit/s Upload (~30 MB/s) and using zstd as compressing method (magnitudes faster on my maschines) plus multicore encryption... I'd assume that IO would usually be often faster than the network... but that's just a sample size of one (me). Makes sense to have the backpressure issue in mind imo.

Alternatively to the suggested modes, only adaptive and sequential could be fine too as stream mode is just adaptive mode with the buffer set to zero or very low.

m90 commented 7 months ago

One thing that just occured to me is that implementing archive-encrypt-upload in a streaming fashion would be a breaking change as it would mean the command lifecycle would be chaning, i.e. when streaming, there is no more possibility of running pre-archive commands or similar. It would just allow for pre and post commands. That's probably ok, but I wanted to leave it here as I just thought of it.

MaxJa4 commented 6 months ago

True, there would only be a start and a finish hook - no matter if using streaming or buffered streaming... unless the user-defined buffer is quite large, then it would make sense to restart the container early and trigger a post-archive hook.

The buffer size value could be:

0 to essentially disable the buffer -> streaming
-1 for automatic size: amount of available space minus some offset -> buffered streaming with optimal downtime
Otherwise xM/xG or simply an integer in MB/GB -> buffered streaming with manual space usage limit

I'd definitely keep the classic sequential processing around in the form of the automatically sized stream buffer (-1 buffer size) to optionally keep downtime of containers as low as possible without risking a storage issue.

Having all "modes" in basically one logic (but with different buffer sizes) would make the code base cleaner and configuration easier, since we wouldn't need two entirely different approaches (sequential vs streaming).

The default option should perhaps be the buffered streaming with a buffer size of -1 (automatic), since a long downtime is bad, but a full disk with potential crash/freeze/data-loss is worse IMO. That could be the best of both worlds: restart containers as soon as possible, but don't use more space than available.

I'd need to do some testing whether df or similar go functions inside the container report sensible values (for me in amd64 Win11 desktop and arm64 Ubuntu server, df did)... only then, the automatic buffer size is feasible.

m90 commented 6 months ago

I was thinking one could have some sort of "event" that signals the end of the archive stream which would then trigger restart while bytes are still munged further down the stream, see https://github.com/offen/docker-volume-backup/issues/95#issuecomment-1118495918

Not sure how realistic that is though.

In any case I would argue optimizing for as little downtime as possible is more important than optimizing for disk space. Disk space is cheap, service downtime is not.

MaxJa4 commented 6 months ago

An event of some sort after the archiving stage is done definitely makes sense.

That's usually the case yes. Maybe we can omit the buffer size entirely and just have buffer=true/false (working title) which would be automatic buffer size (-> use all available space minus some offset) for low downtime or no buffer (false) for low space usage.

That being said, I don't know if there is any benefit of using buffer=false, since the occupied space with the automatic buffer size is freed after the backup anyway... so why not just archive ASAP with the available buffer and get the containers running again quickly. Would mean less complexity in the implementation too. If that's what you essentially meant, then I fully agree :)