Closed Gamester17 closed 3 years ago
From an architectural and technical point of view, High-Availability for Home Assistant and/or Hass.io could be achieved in many different ways, most of which depends if you want to make Home Assistant as an application High-Availability aware/ Cluster-aware or not, as an alternative could be to only implement a High-Availability cluster at the operating-system or container level in Hassio OS where the Home Assistant application is not High-Availability aware/ Cluster-aware, however, such OS/container-level only solution would probably imply an active-passive cluster where the second node is in standby which means that a fail-over would still mean a short downtime.
I understand that one way such OS/container-level the only solution could be achieved in Hass.io is with sing Docker Swarm with Docker Machine. As described here http://mmorejon.github.io/en/blog/docker-swarm-with-docker-machine-high-availability/
Regardless of at which level and type of High-Availability cluster you would choose to go with, one more tricky problems with a High-Availability cluster is that you will also need an external "cluster witness" for High-Availability cluster, and you would have to solve that at a relatively low-cost for home environment, preferably without forcing the need for a third Raspberry Pi. Because, this "witness" will act as a quorum to help make the decision which cluster node should be the master if and when the cluster nodes loose connection to each other, as otherwise you might end up with a "split-brain" scenario which both nodes think that they should have the role as the master node. One easy way to solve this is with a witness file on a third device on your network, that witness file must, in that case, be stored on a file system which can be shared and simultaneously written from multiple nodes at the same time, (such as the NFS file system on a NAS). Another option could be to offer Home Assistant Cloud subscription service from the internet as a cluster witness, but the downside to that is that users would be dependent on the internet for High-Availability cluster to work.
By the way, I do not know how Z-Wave would work with several adapter controllers on multiple nodes with Home Assistant in a High-Availability cluster, but for Zigbee adapter controllers in the above suggested active-active cluster you should technically be able to set the Zigbee adapter in one of the nodes (the 'master node') to be a "Zigbee coordinator" and the other node (the 'slave node') to a "Zigbee router", as then the second node will just route/forward/relay all the Zigbee signals to the first node as a range extender. These roles of "Zigbee coordinator" and "Zigbee router" would then have to automatically change when you have a cluster failover scenario, with the "Zigbee router" being promoted to a "Zigbee router" by the Home Assistant ZHA component.
If we promote router to coordinator, do we need to re-pair all devices?
I think that one of a "High-Availability (HA) cluster of Home Assistant" should be a solution for quickly and automatically getting back to operation after a fail-over. If that is agreed on objective then a re-pair all devices is not an option. HA fail-over from one node to the other (and back) should really be possible without any user operation what so ever.
Given that everything is already containerized, running these containers in a swarm/k3s/k8s HA cluster should be feasibile with moderate effort.
Cloud connections would adapt seamlessly as they are created by the instance running, so if one drops, the next recreates the connection and that service keeps running.
What needs a real solution is how to keep wired devices connected. For usb devices there could be solutions, but they require client software. Moreover the machine physically connected to the wired device would become a single point of failure, so if that one goes down then the device is unreachable to any HA cluster. This kind of makes the whole HA point moot in my opinion.
Everyone seems focused on wired devices. I have never used any wired devices exactly because they cause this problem. And can't be the only one. There are alternatives to wired devices like a zigbee2mqtt bridge. IMO there should be an easy High-Availability solution for Home Assistant that just states in bold text "This does not support wired devices". And even if I did use wired stuff, I would prefer to have a fallback for the core Home Assistant server even when anything wired is toast. Home Assistant is not only zigbee/z-wave after all. Having anything working is better then nothing.
PS My idea is to have separate servers (RPi maybe) in different rooms, so if one is on fire, the other one can take over (and maybe raise a fire alarm :D ). This means that any attached USB devices will be on fire as well, so no USB switch or forwarding is going to help. Ok, maybe instead of a fire, it is just a blown fuse, but you get the idea.
Then nothing prevents you from running ha instance in swarm.
It should be noted that you could soon be able to backup a Zigbee coordinator and restore that to some other hardware, see these:
https://github.com/zigpy/zigpy/issues/557
https://github.com/Koenkk/zigbee-herdsman/pull/303
https://github.com/zigpy/open-coordinator-backup/
Also, note that backup and restore for some Z-Wave controllers and Zigbee coordinators of the same type is already possible, e.g.
https://aeotec.freshdesk.com/support/solutions/articles/6000108806-z-stick-gen5-backup-software
https://github.com/zha-ng/zigpy-znp
NVRAM Backup and restore
A complete NVRAM backup can be performed to migrate between different radios based on the same chip. Anything else will not work.
(venv) $ python -m zigpy_znp.tools.nvram_read /dev/serial/by-id/old_radio -o backup.json
(venv) $ python -m zigpy_znp.tools.nvram_write /dev/serial/by-id/new_radio -i backup.json
Tested migrations:
https://github.com/zigpy/bellows/pull/295
Theoretically this allows backing up and restoring NCP state between the hardware version. Please note this is highly experimental. The restoration doesn't restore NCP children and relies on children just to find a new parent.
To export TC config, see `bellows backup --help` usually it is just `bellows backup > your-backup-file.txt`.
The backup contains your network key, so you probably should keep that file secure.
To import, see `bellows restore --backup-file your-backup-file.txt`
Many older RF protocol standards before Zigbee and Z-Wave did not have anything stored or running on the USB device itself.
Personally, I would love to be able to easily set up two Zigbee coordinator bridges like Sonoff ZBBridge WiFi to Zigbee bridges or these DIY Ethernet to Zigbee bridges as a pair of redundant Zigbee coordinators where one of them is primary which is used and get regular backups done from it that can in case of failure be restored to the secondary Zigbee coordinator bridge which would act as a warm/hot standby device (always-on standby just waiting for a Zigbee coordinator restore).
https://www.digiblur.com/2020/07/how-to-use-sonoff-zigbee-bridge-with.html
https://github.com/zigpy/zigpy/discussions/584
Again please see discussion about Zigbee coordinator backup and restore here:
https://github.com/zigpy/zigpy/issues/557
https://github.com/Koenkk/zigbee-herdsman/pull/303
https://github.com/zigpy/open-coordinator-backup/
PS: ZHA requires Tasmota or ESPHome serial server and standard Silicon Labs EmberZNet EZSP Zigbee firmware for EFR32 on these.
Can the second HA instance join the ZigBee as a router? Will it still be able to bind to the end device clusters and access sensor data?
Can the second HA instance join the ZigBee as a router?
Not yet, but IIRC read somewhere a comment Adminiuga wrote to dmulcahey which hinted that it could be possible in the future.
Yes, I know it has not yet been developed. This is an architecture issue, to discuss what's possible.
Let me comment on this issue.
.storage
that needs to be replicated and synchronised A few points on High Availability (i.e. active-active setup). This is about the application availability, not the database.
time%N+K==0
, if you forgive me for the C notation). If there are other internally generated events, they must be treated the similar way.I would be willing to start working on the HA support. It looks like a popular demand, apart availability it would also improve scalability and overall performance. But I would need a high-level approval from the core team on the approach. @balloob what do you think, is HA a good idea, if we manage to do it with the proper care, with documents and staged approach, and without breaking or overcomplicating things?
1. You are missing the amount of information stored in in `.storage` that needs to be replicated and synchronised
Yes, you are right. BTW, why don't we keep all of it in the DB?
The database only contains history/stats. We don't keep any other data in the databases.
Baking replication/consistency/high available into home assistant itself could be pretty complex, and so the comments above about leaning to a fast failover with active/standby using the containerization platform (outsourced entirely such that only one instance is active at a time) would be much more feasible -- and would meet most users requirements.
Maybe start by getting alignment on the external storage problem pointed out. This could be quite a challenge already, and seems an order of magnitude simpler. Edit: That reads like a feature request, rather than an architecture issue with a propose plan for comment, so maybe that specific issue isn't the right one.
(In the case of my cluster, i'd run external storage via kubernetes and figure out some kind of leader election, I guess similar to the docker swarm proposal above. I would imagine we wouldn't' want Home Assistant to have an opinion on how users achieve this, though there could be some common recipes for folks to share. )
My 2cents
Maybe not all home assistant installation methods are (initially) feasible for high availability.
Supervised can more easily accept a new/revised supervisor that could synchronize with other instances and move the master in case of failure. Supervisor is already keeping the structure of containers and their versions stable based on a certain configuration, that configuration (of containers) can be shared among a cluster of supervisors and then these can replicate the same configuration across many machines.
Then the next point is what would be the requirements for any other supervisor to start running HA? DB globally available to each supervisor for sure
But then what is the failure scenario? If the entire machine goes down, then it's safe to say another instance would probably be on a different machine and possibly on another network.
By locking the failure scenario (at least initially) to a "same network" case, then a reworked supervisor could do what is being discussed.
And a future step becomes then possible: decouple hardware interfacing from HA itself and build a separate container that can run anywhere (basically off load and decouple local network interfacing as well as serial port interfacing from HA core) , then even more disaster scenarios become available
On Wed, 3 Mar 2021, 16:38 Allen Porter, notifications@github.com wrote:
Baking replication/consistency/high available into home assistant itself could be pretty complex, and so the comments above about leaning to a fast failover with active/standby using the containerization platform (outsourced entirely such that only one instance is active at a time) would be much more feasible -- and would meet most users requirements.
Maybe start by getting alignment on the external storage https://github.com/home-assistant/architecture/issues/472 problem pointed out. This could be quite a challenge already, and seems an order of magnitude simpler.
(In the case of my cluster, i'd run external storage via kubernetes and figure out some kind of leader election https://kubernetes.io/blog/2016/01/simple-leader-election-with-kubernetes/#leader-election-with-sidecars, I guess similar to the docker swarm proposal above. I would imagine we wouldn't' want Home Assistant to have an opinion on how users achieve this, though there could be some common recipes for folks to share. )
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/home-assistant/architecture/issues/152#issuecomment-789805719, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM62LOZTSKQSMXOORQC5LE3TBZJZBANCNFSM4GYCOVWQ .
The database only contains history/stats. We don't keep any other data in the databases.
Not to criticize, but genuinely interested: that's a statement regarding current state of affairs, not an answer to the question asked, i.e. why?
So, I'm still always confused about this. You can still make a reasonable high availability system, with HA just as it is today.
Replicate the storage (e.g., DRDB, GlusterFS, ZFS replication, whatever floats your boat), add some more into the mix; keepalived, HA Proxy, Corosync, Pacemaker or even live migration of VMs. Add an external DB, like dunno MariaDB and put it into a Galera cluster... Pick whatever you like in all these cases.
There are so many possibilities to solve this all, outside of HA. I'm still completely confused why HA itself has to worry about a complex use case, that will end up being used by just a few... Especially, considering the tons of tools available, that already make this all possible.
Because a failover takes time first to detect the disaster condition and then to start another node. But with the Active-Active setup you will already have a running node, so it will take no downtime at all. Can you imagine updates with no downtime and 100% availability of your smart lights? For starters, that's just cool!
@frenck those solutions sound like a nightmare of maintenance and cost, this is a one example of just because you can do something, should you?
If a stateful applications' architecture does not support being Highly Available there will always be issues you will face no matter how much you hack away at trying to solve it from outside the application code.
I run Home Assistant in Kubernetes and no way I will try to architect a way around for it to be HA. I am completely fine with it being a SPOF. But it goes without saying that if Home Assistant ever did implement this type of architecture I would be one of the first to jump on it. :)
But with the Active-Active setup you will already have a running node, so it will take no downtime at all.
Even if we did the best job in the world, that cannot be achieved. Simply because a lot of devices/services will not handle that right. Besides, be honest here, a minute downtime for takeovers, is absolutely not a problem? Heck, make it 10 minutes.
@frenck those solutions sound like a nightmare of maintenance and cost
Oh no, all listed is open source, so in terms of costs; that would be time. Doing High-Availability is never easy; any solution will cost maintenance.
We are not dealing with an application with a database that simply spits out pages like a website or so. We are dealing with an application that relies on a lot of external sources (devices/services). I bet a lot of devices and services will simply not be able to handle High Availability cases in a way that matches any of the wishes from this thread; there is also an active running state engine with an automation engine on top (that relies on runtime/memory).
Building Home Assistant into a true High Availability application, will be a nightmare. Especially considering all tools for this already available for making a reasonable setup that already could do this, the question becomes:
"Is the juice worth the squeeze?"
In my personal opinion: Most definitely not.
We're not going to implement any form of high availability. The added complexity is not worth it.
Would it be possible to make Home Assistant into a High-Availability (HA) application with automatic fail-over as an installation option for redundancy?
That is, have Home Assistant run as High-Availability (HA) application multiple instances not simply for forwarding/repeater or performance functionality, but as in High-Availability (HA) cluster for automatic failover functionality?
Home Assistant supports (or did support?) Home Assistant multiple instances synchronized using a master-slave model, however, as I understand it, it does not support a true High-Availability (HA) set-up as the slave only has forwarding/repeater functionality and there is no automatic fail-over function where the slave has a full replica of the database and is automatically promoted to master in a failure scenario(?).
As requested/discussed in the forum here https://community.home-assistant.io/t/high-availability-ha/52785
As smart home devices/appliances are more and more becoming part of our everyday usage we are starting to rely on availability of our home automation controllers as dependencies, especially if you are using Home Assistant on a computer with a Z-Wave and/or Zigbee dongle, therefore it would be very nice to have the option to be able to achieve a higher degree of resilience with the help of multiple installations of Home Assistant on the same home network working as one, in a true High-Availability (HA) setup to secure its uptime; either in an active-active with no master or an active-passive capacity where the slave node can automatically be promoted into the master in a fail-over scenario. I guess that most common scenario today is corrupt or failure of SD-card when running Home Assistant on Raspberry Pi computers, but another common scenario could be regressions issues on upgrades which are usually covered by not upgrading both High-Availability nodes at the same time, something which could be possible as long as the High-Availability function is compatible over different versions of Home Assistant.
Home Assistant running a High-Availability (HA) application would imply Home Assistant running for example on two Raspberry Pi computers (a.k.a. nodes) as one Home Assistant instance (a.k.a. HA-cluster). In an active-passive capacity both Raspberry Pi computers with be running Home Assistant however one node would be in a kind of standby mode just waiting for the other to stop responding before taking over (fail-over) and if you have no physical Z-Wave or Zigbee USB/serial adapters connected then the second node could take over all functions directly, or if you have physical Z-Wave or Zigbee USB/serial adapters connected then it would notify and prompt you to move your adapters to that second node. In an active-active setup both Raspberry Pi computers would have the exact same hardware installation, meaning both Raspberry Pi computers/nodes have to each have the same type of physical Z-Wave or Zigbee USB/serial adapters connected.
Another very nice feature many modern High-Availability clustered applications (especially network appliances) have today which Hass.io OS could have with this is a feature referred to as "Cluster Operating System Rolling Upgrade" which means automatic rolling updates of the full application and/or whole operating-system one cluster node at a time, which normally allow for continuous delivery of all functions without any downtime as each node in a cluster fully takes on all functionality alone while the other nodes perform an upgrade and then vice versa in an automatic fail-over and fail-back procedure. so such rolling upgrades in a High-Availability clustered do not interrupt services to the users.
If you are using Home Assistant to for example control all your heating and/or lights then having it run in a High-Availability (HA) setup for higher reliability could certainly also make it reaches a higher WAF (Wife Acceptance Factor), as in my experience, tinkering too much with heating or lights on your one and only home automation controller is normally not acceptable when you have a wife/partner and young kids at home, as then your home automation controller is in "production" as has to be working all the time.