offen / docker-volume-backup

Backup Docker volumes locally or to any S3, WebDAV, Azure Blob Storage, Dropbox or SSH compatible storage
https://offen.github.io/docker-volume-backup/
Mozilla Public License 2.0
1.87k stars 81 forks source link

Reduce storage footprint when using GPG encryption #95

Open simboel opened 2 years ago

simboel commented 2 years ago

When using GPG_PASSPHRASE this tool will create two files in the /tmp directory:

  1. The backup-*.tar.gz file
  2. An decrypted backup-*.tar.gz.gpg

It's not a bug, but it will lead to a high storage usage. This is problematic on restricted servers for example, where storage space is expensive. Example: On one of my servers I've got 90 GiB storage. With the current implementation the server application itself must not exceed 30 GiB or the server would encounter an out of space error when executing the backup.

Not expected but an idea: When using GPG_PASSPHRASE the result of the tar command could (probably) directly be piped to gpg. This way only the *.gpg file would be saved to the storage.

m90 commented 2 years ago

When using GPG_PASSPHRASE the result of the tar command could (probably) directly be piped to gpg

Since this is not a bash script, it's not that easy, but it could still be done without having to resort to using the intermediate file and instead chain the tarWriter and the OpenPgpWriter. Maybe it would even be possible to upload the file in a streaming manner for some backends and remove the need for an intermediate file entirely.

This is a little complicated though (mostly because the code is written with that mechanism in mind), so it's much more than a one-line change.

If anyone wants to pick this up, I am happy to review and merge in PRs. Else I might be able to look into this at some point, but I cannot really make any estimates.


A side note question: the artifacts do get deleted properly after a backup run, right? So the storage footprint is only an issue while the backup is running, correct?

simboel commented 2 years ago

Thanks for the reply. Chaining the upload would've been my next question 👍

Maybe I can find some time to start developing this feature.

m90 commented 2 years ago

On a very high level this could work like this (from the top of my head, so it might be wrong in certain details):

In case you decide to start working on this feel free to ask questions any time and also don't shy away from improving the pipeline if you find things that are rather odd right now.

m90 commented 2 years ago

I looked into how this could implemented a little and found a tricky detail hidden: if a streaming pipeline would be used that possibly does create tar archive -> possibly encrypt -> upload this would also mean that users who want to stop their containers during backup might see a considerable increase in downtime as this now means we could see backpressure or similar. I think it's possible to work around this mostly (emit an event once the tar writer stopped writing or similar), but this makes the refactoring needed even bigger.

I'll try to think about how the entire script orchestration could be refactored so it can accomodate for such requirements.

MaxJa4 commented 1 year ago

Maybe different modes, which the user can select, would make sense here:

Adaptive mode would solve the backpressure issue. For example with Backblaze B2 I have usually 250 MBit/s Upload (~30 MB/s) and using zstd as compressing method (magnitudes faster on my maschines) plus multicore encryption... I'd assume that IO would usually be often faster than the network... but that's just a sample size of one (me). Makes sense to have the backpressure issue in mind imo.

Alternatively to the suggested modes, only adaptive and sequential could be fine too as stream mode is just adaptive mode with the buffer set to zero or very low.

m90 commented 7 months ago

One thing that just occured to me is that implementing archive-encrypt-upload in a streaming fashion would be a breaking change as it would mean the command lifecycle would be chaning, i.e. when streaming, there is no more possibility of running pre-archive commands or similar. It would just allow for pre and post commands. That's probably ok, but I wanted to leave it here as I just thought of it.

MaxJa4 commented 6 months ago

True, there would only be a start and a finish hook - no matter if using streaming or buffered streaming... unless the user-defined buffer is quite large, then it would make sense to restart the container early and trigger a post-archive hook.

The buffer size value could be:

I'd definitely keep the classic sequential processing around in the form of the automatically sized stream buffer (-1 buffer size) to optionally keep downtime of containers as low as possible without risking a storage issue.

Having all "modes" in basically one logic (but with different buffer sizes) would make the code base cleaner and configuration easier, since we wouldn't need two entirely different approaches (sequential vs streaming).

The default option should perhaps be the buffered streaming with a buffer size of -1 (automatic), since a long downtime is bad, but a full disk with potential crash/freeze/data-loss is worse IMO. That could be the best of both worlds: restart containers as soon as possible, but don't use more space than available.

I'd need to do some testing whether df or similar go functions inside the container report sensible values (for me in amd64 Win11 desktop and arm64 Ubuntu server, df did)... only then, the automatic buffer size is feasible.

m90 commented 6 months ago

I was thinking one could have some sort of "event" that signals the end of the archive stream which would then trigger restart while bytes are still munged further down the stream, see https://github.com/offen/docker-volume-backup/issues/95#issuecomment-1118495918

Not sure how realistic that is though.

In any case I would argue optimizing for as little downtime as possible is more important than optimizing for disk space. Disk space is cheap, service downtime is not.

MaxJa4 commented 6 months ago

An event of some sort after the archiving stage is done definitely makes sense.

That's usually the case yes. Maybe we can omit the buffer size entirely and just have buffer=true/false (working title) which would be automatic buffer size (-> use all available space minus some offset) for low downtime or no buffer (false) for low space usage.

That being said, I don't know if there is any benefit of using buffer=false, since the occupied space with the automatic buffer size is freed after the backup anyway... so why not just archive ASAP with the available buffer and get the containers running again quickly. Would mean less complexity in the implementation too. If that's what you essentially meant, then I fully agree :)