home-assistant / supervisor

:house_with_garden: Home Assistant Supervisor
https://home-assistant.io/hassio/
Apache License 2.0
1.74k stars 638 forks source link

Network Storage: Backups - UX overhaul to address multiple issues #4866

Closed jmealo closed 7 months ago

jmealo commented 9 months ago

The problem

There are a number of related issues with network storage backups that can cause data loss and inflict great pain on users. This issue will require multiple fixes and improvements to fully address.

Technical issues:

UX issues:

Data loss issues

Opportunities for improvement

GitHub issues:

What version of Home Assistant Core has the issue?

All

What was the last working version of Home Assistant Core?

Never

What type of installation are you running?

Home Assistant OS

Integration causing the issue

No response

Link to integration documentation on our website

No response

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

haywiremk commented 9 months ago

https://github.com/home-assistant/supervisor/issues/4358#issuecomment-1818009197

Since this bug basically tags many of the open issues around the root issue and has been around for months with zero indication of any priority being applied to a root cause fix the above link is the way to delete the local files over SSH to empty the directory so it can mount again. Hopefully save others from additional searching. Kudos to the original poster as I always wondered how to get such access on the colored HA boxes remotely.

jmealo commented 9 months ago

home-assistant/supervisor#4358 (comment)

Since this bug basically tags many of the open issues around the root issue and has been around for months with zero indication of any priority being applied to a root cause fix the above link is the way to delete the local files over SSH to empty the directory so it can mount again. Hopefully save others from additional searching. Kudos to the original poster as I always wondered how to get such access on the colored HA boxes remotely.

Thanks for posting here. I don't know if there's a one-size-fits-all workaround. I think the exact steps could differ based on where and how you run HA.

I can get steps to workaround with KVM on Ubuntu.

I'm wondering if they have very low feature usage for this and aren't allocating effort based on that?

I offered to help on Discord if I could get someone in core to approve/collaborate on a solution. I didn't hear anything so I tagged everyone who touched this code. Not sure if that's considered poor etiquette in this community.

I'm very curious to how code ownership works in these repos as I haven't found anyone to collaborate on a fix despite my best efforts to do so.

IvovanWilligen commented 9 months ago

I've the same issue as mentioned in https://github.com/home-assistant/supervisor/issues/4358

I run HA on a raspberry PI 4 and unfortunately I get stuck trying to get the solution in comment https://github.com/home-assistant/supervisor/issues/4358#issuecomment-1818009197 to work.

I would be very grateful if the HA core team would take some time to get ride of the backup bugs.

jmealo commented 8 months ago

Here's a proposed fix that I believe will address these issues: https://github.com/home-assistant/supervisor/issues/4856

jmealo commented 8 months ago

Architecture discussion on a proposed fixed to the Network Storage issues: https://github.com/home-assistant/architecture/discussions/1033

agners commented 8 months ago

@jmealo first of all, thanks for collecting all these issues and taking these notes :pray:

The things are a bit sprinkled all over the place now :sweat_smile: The network mount is a Supervisor feature exclusively. So this discussion belongs into the Supervisor repository. As a first step I've moved this issue which used to be in the Core repository over here into the Supervisor repository.

Architecture discussion on a proposed fixed to the Network Storage issues: https://github.com/home-assistant/architecture/discussions/1033

Also this would probably more belong in here. E.g. the overall design of the mount feature was not discussed in the architecture repository. Most discussions were in #2564, and some discussion are not captured on GitHub as they happened on Discord or other places.

I'd suggest to use this issue tracker, specifically this very issue, to further discuss how we proceed with network storage.

agners commented 8 months ago

A bit of background: The network storage feature makes use of systemd mounts. Essentially, the Supervisor instructs systemd running on the operating system to create mounts using D-Bus. The mounts are not persisted on the OS side. This means on reboot the Supervisor instructs systemd to recreate those mounts.

Furthermore, we use a 2 stage system: We mount a network storage internally to a common place, and then bind mount it to the actual place.

It seems we have life cycle problems, especially around appearing/disappearing network storage systems. Technically, we should be able to get notified about a failing mount from systemd via D-Bus. However, some of these cases are just not captured even by the OS (e.g. when a NAS disappears, and nothing is being written to it, then the system might just not notice... until something is actually getting written to it! The question becomes what happens in this case? I guess whoever writes at that point will get an error on his write system call. What I wonder is if the systemd mount unit also fails :thinking: This needs a bit of investigation).

Conceptually, I'd say the system should behave as follows: a) Supervisor should notify the user about any failed network mount. This can be at startup, or at whatever point this might happen. The repair then should reliably mount the storage again. b) If a backup got triggered with a target location which is supposed to be a network mount (but failed to mount), the user should be notified. It is a bit a debate if we should still create a backup then, just on the local storage :thinking: Having a backup is better then none. On the other hand, this has a huge potential of filling the disk, obviously.

For a), I think there is mainly one issue, which is https://github.com/home-assistant/supervisor/issues/4358.

For b), we probably should define what behavior we exactly want, and implement this accordingly.

I also think that some issues of the ones above are probably no longer valid. E.g. 2024.01.0 improved backup error handling, or 2023.12.0 fixed https://github.com/home-assistant/supervisor/pull/4733 which was a problem during unmounting of network storage.

What would be helpful if we can collect the issues which have the same underlying problem. Also it would be nice to have step by step instructions how to reproduce those underlying problems, so we can reproduce them and work on the fix.

jmealo commented 8 months ago

@agners Thanks for the thoughtful reply. We agree on the "non-empty directory mount issue" being the primary issue at play here.

Where would be the appropriate place to put automated tests where we can set up a flaky NFS and CIFS test suite? I could start looking into that so we can better understand the issues.

github-actions[bot] commented 7 months ago

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

IvovanWilligen commented 7 months ago

Why is this closed? It should be adressed properly! It's still THE problem why I don't have off device backups at the moment.

agners commented 7 months ago

Why is this closed?

Because it went stale, just read the comment of the github-actions bot.

It should be adressed properly! It's still THE problem why I don't have off device backups at the moment.

This issue is a collection of issues (which is probably is a bit problematic in itself :sweat: ). What exact problem are you still experience?