Home Assistant snapshots without temporary files

feature request

Problem:

We have large, relatively slow, and flaky microsd cards. Currently, the addon invokes home assistant snapshot creation on local media, triggering file reading/writing in order to create the compressed archive. On a microsd card, this can take a while. Then, this archive is read again in order to upload it to Google Drive. Most of the time with nightly backups these daily/nightly snapshots get deleted without ever being used. Don't write the temporary file.

Proposal:

Have the supervisor write into a pipe. Have the addon read from a pipe and upload to drive.

This can be a named pipe on the filesystem, like the one you can create on the shell with mkfifo(1) in the same directory as snapshots... (I'm guessing that's the simplest option). We can communicate pipe (file?)names over the existing snapshot creation api.

I'd expect not having to write the data to the sd card (and only reading instead), would make snapshot creation more than 2x quicker and would also extend the life of already not so long-lasting sd cards - this would be great all around.

... but it requires changes in both Home Assistant as well as this addon to actually make it work. I'm filing a feature request here, since the Home Assistant feature changes would be useless without some code here that can use the feature.

... yay/nay/maybe? good idea? thoughts?

if yay, where would we go from here?

I actually spent some time trying to understand the code,

The way I read it, home assistant core, calls the supervisor using http over an ip socket passing parameters after some light weight validation, and ends up in this class that deals with archiving and unarchiving in a mostly self contained way:

https://github.com/home-assistant/supervisor/blob/564e9811d0f594d95152f298514e2a1942dbad67/supervisor/snapshots/snapshot.py#L66

The archiving mechanism itself is very simplistic and really bad from a perspective of i/o performance and wear-and-tear. An operation to create a snapshot ends up writing the data twice (not once as I originally thought), first it creates a temporary directory with all the addons and folders. Then it archives it (second write of same data), then this addon uploads it.

The last/final archiving is done using python built-in tarfile.TarFile in "w|gz" mode into a fileobj: https://github.com/home-assistant/supervisor/blob/564e9811d0f594d95152f298514e2a1942dbad67/supervisor/utils/tar.py#L82

...or here into an actual filename in case it's not encrypted: https://github.com/home-assistant/supervisor/blob/564e9811d0f594d95152f298514e2a1942dbad67/supervisor/utils/tar.py#L54

I guess it's kind of ok.

It's looking like:

do some light cleanup and refactoring.
plumb some TarFile compatible fileobj through SecureTarFile and up the stack.
figure out how to stream/write tar file contents. since tar files require archive member file size to be known in advance, we might need to split large member files that don't fit into ram / buffers --- which we can only figure out after we roll-over the first buffer size
Figure out how to correctly send data to this add-on. It occurred to me while making coffee earlier, that simple file system fifo pipes don't have any concept of communicating done-ness, we could be hung reading forever. To mitigate we could come up with a trivial protocol that just writes number of bytes, and then the bytes themselves until it writes a zero in the end. It doesn't have to super smart, if we need a smarter protocol later we can always request it when we ask for a snapshot. Also, how would cancellation work - I'll guess size of next block of data that's about to be transmitted can be -1?

Most of this should go into a home assistant issue. I'll file an issue over there and cross reference.

It might be a while until I implement all of this in the supervisor and get it through a review (not sure what the process is like).

I should say upfront that I'm not sure if doing this would be worth the work, but it might be. It would be worthwhile to run your plans through the supervisor people on their discord. I've made a handful of contributions to the HA and Supervisor codebases, and while they're largely pleasant and helpful they are overburdened and its difficult to get their attention on issues that aren't critical bugs or the new feature they're already working on.

Some issues that might be a problem in the supervisor:

The snapshot file format might have assumptions baked in that make it impossible to stream. I'm not sure why it outputs the intermmediate tar files to disk before packaging them into the final tar instead of just wrapping tarfile streams on top of eachother.
To stream a tar from the supervisor's webserver I think you'd have to do all the tarfile packaging within the python async library, and I'm not aware of a python tar library that does its I/O with async. At least, I haven't been able to find one. I suppose you might be able to do some kind of thread management to get around this but it would be a mess.
If the goal is to be able to store no snapshots locally (ie in order to use all the space on an SD card) then restoring snapshots like this will be a problem. If I have a 32GB SD card and a 20GB snapshot, I'd never be able to restore it. I can't think of a way to make restoration something that could be streamed.

The way the addon works right now is that it streams the snapshot's file contents from an HTTP endpoint the supervisor provides directly to Google Drive. Presumably if the supervisor provided an endpoint that streams from memory instead of disk it would be almost trivial for the addon to support it. It would be a headache for anyone with a flaky internet connection because you couldn't resume partial uploads that fail part way through (since you can't step back in the stream) but it could be a configurable option for those who want to maximize their disk space availability and are willing to sacrifice reliability.

Its also worth noting that there are many other features I'd want to add first to how the supervisor handles snapshot creation, which would benefit not only all those using this addon but also anyone creating snapshots anywhere:

Support for canceling a snapshot in progress.
Support for checking the status of a current snapshot (eg current size, whether a snapshot is running, etc). Right now the addon can only "guess" if snapshot is being made if it gets an error when it attempts to make a new one.
A periodic check of disk space available. Right now the supervisor will consume all disk space and make the machine unusable (sometimes unrecoverable) if it runs out of space while makiing a snapshots.
Cleaning up of the temp files when a snapshot is finished or cancelled.

After a few failed attempts and some development environment setbacks I have an approach that I think works at least code wise for creating tar snapshots in a streamable way.

I still have to finish all the mechanical changes refactoring throughout the rest of the code base and cleanup a little.

And I'll need to figure out how to build and deploy it in qemu or in hyper-v and do testing by hand - the built-in unittest are really poorly covering things.

The basic manual testing plan is to take one of my snapshots, restore it into a VM using new code, then produce a new snapshot from there, and then compare tar contents with the original by looking into various internal tar data structures.

I think having this implemented will effectively work around the temp storage issues, where there's leftover temporary "junk" that's never cleaned up from storage in case of a crash. I've just been removing access to these temporary directory path variables as I go and eventually we can just not create them.

I thought about cancellations and/or progress reporting as well, even when storing a snapshot locally. We should be able to abort() the FileWriter, and cause failing writes to propagate as exceptions up the stack and cause various cleanups and closings to happen. Plumbing any additional status reporting and free space checks into a FileWriter (which is an io.BufferedWriter with some extra stuff) would work, perhaps we could even make a tee-ing file writer version that would simultaneously store a snapshot locally as well as relay it to the addon, but I think I'd perhaps leave that for next iteration.

On the subject of async vs thread, it's making things slightly more difficult but not insurmountably so. The existing code is also imperfect in that regard - for example, there's already some small filesystem access from async functions here and there that could block and in my opinion shouldn't be run from an async task/function, I'm not going to be obsessing about fixing that.

On the subject of uploads, Google Drive has several APIs, there's one that supports resuming uploads, I haven't looked at this extension I'm not sure what's in use.

I'll keep at it, and hopefully I'll have something working and worth publishing soon, I expect around a 1000 line git diff in the end and already have a person I'd trust with this stuff lined up to do the initial code review, although they're not from Nabu Casa, so not sure how useful it'd be.

sabeechen / hassio-google-drive-backup

Home Assistant snapshots without temporary files #421