sabeechen / hassio-google-drive-backup

Automatically create and sync Home Assistant backups into Google Drive
MIT License
2.99k stars 181 forks source link

Home Assistant snapshots without temporary files #421

Open srdjanrosic opened 3 years ago

srdjanrosic commented 3 years ago

feature request

Problem:

We have large, relatively slow, and flaky microsd cards. Currently, the addon invokes home assistant snapshot creation on local media, triggering file reading/writing in order to create the compressed archive. On a microsd card, this can take a while. Then, this archive is read again in order to upload it to Google Drive. Most of the time with nightly backups these daily/nightly snapshots get deleted without ever being used. Don't write the temporary file.

Proposal:

Have the supervisor write into a pipe. Have the addon read from a pipe and upload to drive.

This can be a named pipe on the filesystem, like the one you can create on the shell with mkfifo(1) in the same directory as snapshots... (I'm guessing that's the simplest option). We can communicate pipe (file?)names over the existing snapshot creation api.


I'd expect not having to write the data to the sd card (and only reading instead), would make snapshot creation more than 2x quicker and would also extend the life of already not so long-lasting sd cards - this would be great all around.

... but it requires changes in both Home Assistant as well as this addon to actually make it work. I'm filing a feature request here, since the Home Assistant feature changes would be useless without some code here that can use the feature.

... yay/nay/maybe? good idea? thoughts?

if yay, where would we go from here?

image

srdjanrosic commented 3 years ago

I actually spent some time trying to understand the code,

The way I read it, home assistant core, calls the supervisor using http over an ip socket passing parameters after some light weight validation, and ends up in this class that deals with archiving and unarchiving in a mostly self contained way:

https://github.com/home-assistant/supervisor/blob/564e9811d0f594d95152f298514e2a1942dbad67/supervisor/snapshots/snapshot.py#L66

The archiving mechanism itself is very simplistic and really bad from a perspective of i/o performance and wear-and-tear. An operation to create a snapshot ends up writing the data twice (not once as I originally thought), first it creates a temporary directory with all the addons and folders. Then it archives it (second write of same data), then this addon uploads it.

The last/final archiving is done using python built-in tarfile.TarFile in "w|gz" mode into a fileobj: https://github.com/home-assistant/supervisor/blob/564e9811d0f594d95152f298514e2a1942dbad67/supervisor/utils/tar.py#L82

...or here into an actual filename in case it's not encrypted: https://github.com/home-assistant/supervisor/blob/564e9811d0f594d95152f298514e2a1942dbad67/supervisor/utils/tar.py#L54

I guess it's kind of ok.

It's looking like:

  1. do some light cleanup and refactoring.
  2. plumb some TarFile compatible fileobj through SecureTarFile and up the stack.
  3. figure out how to stream/write tar file contents. since tar files require archive member file size to be known in advance, we might need to split large member files that don't fit into ram / buffers --- which we can only figure out after we roll-over the first buffer size
  4. Figure out how to correctly send data to this add-on. It occurred to me while making coffee earlier, that simple file system fifo pipes don't have any concept of communicating done-ness, we could be hung reading forever. To mitigate we could come up with a trivial protocol that just writes number of bytes, and then the bytes themselves until it writes a zero in the end. It doesn't have to super smart, if we need a smarter protocol later we can always request it when we ask for a snapshot. Also, how would cancellation work - I'll guess size of next block of data that's about to be transmitted can be -1?

Most of this should go into a home assistant issue. I'll file an issue over there and cross reference.

It might be a while until I implement all of this in the supervisor and get it through a review (not sure what the process is like).

sabeechen commented 3 years ago

I should say upfront that I'm not sure if doing this would be worth the work, but it might be. It would be worthwhile to run your plans through the supervisor people on their discord. I've made a handful of contributions to the HA and Supervisor codebases, and while they're largely pleasant and helpful they are overburdened and its difficult to get their attention on issues that aren't critical bugs or the new feature they're already working on.

Some issues that might be a problem in the supervisor:

The way the addon works right now is that it streams the snapshot's file contents from an HTTP endpoint the supervisor provides directly to Google Drive. Presumably if the supervisor provided an endpoint that streams from memory instead of disk it would be almost trivial for the addon to support it. It would be a headache for anyone with a flaky internet connection because you couldn't resume partial uploads that fail part way through (since you can't step back in the stream) but it could be a configurable option for those who want to maximize their disk space availability and are willing to sacrifice reliability.

sabeechen commented 3 years ago

Its also worth noting that there are many other features I'd want to add first to how the supervisor handles snapshot creation, which would benefit not only all those using this addon but also anyone creating snapshots anywhere:

srdjanrosic commented 3 years ago

After a few failed attempts and some development environment setbacks I have an approach that I think works at least code wise for creating tar snapshots in a streamable way.

I still have to finish all the mechanical changes refactoring throughout the rest of the code base and cleanup a little.

And I'll need to figure out how to build and deploy it in qemu or in hyper-v and do testing by hand - the built-in unittest are really poorly covering things.

The basic manual testing plan is to take one of my snapshots, restore it into a VM using new code, then produce a new snapshot from there, and then compare tar contents with the original by looking into various internal tar data structures.

I think having this implemented will effectively work around the temp storage issues, where there's leftover temporary "junk" that's never cleaned up from storage in case of a crash. I've just been removing access to these temporary directory path variables as I go and eventually we can just not create them.

I thought about cancellations and/or progress reporting as well, even when storing a snapshot locally. We should be able to abort() the FileWriter, and cause failing writes to propagate as exceptions up the stack and cause various cleanups and closings to happen. Plumbing any additional status reporting and free space checks into a FileWriter (which is an io.BufferedWriter with some extra stuff) would work, perhaps we could even make a tee-ing file writer version that would simultaneously store a snapshot locally as well as relay it to the addon, but I think I'd perhaps leave that for next iteration.

On the subject of async vs thread, it's making things slightly more difficult but not insurmountably so. The existing code is also imperfect in that regard - for example, there's already some small filesystem access from async functions here and there that could block and in my opinion shouldn't be run from an async task/function, I'm not going to be obsessing about fixing that.

On the subject of uploads, Google Drive has several APIs, there's one that supports resuming uploads, I haven't looked at this extension I'm not sure what's in use.

I'll keep at it, and hopefully I'll have something working and worth publishing soon, I expect around a 1000 line git diff in the end and already have a person I'd trust with this stuff lined up to do the initial code review, although they're not from Nabu Casa, so not sure how useful it'd be.