Multiple MQTT brokers - Githubissues

vladbabii commented 8 years ago

short version: any tips on what would be the best way to add support for multiple brokers ?

First, thank you for your work! I run a small home automation network based on three mqtt brokers that are linked together and a bit of home-brewed "glue" code that runs 3 times in parallel on all 3 servers.

For the mqtt connection case in what I'm using now, I set 3 ip's for mqtt brokers and the devices try to connect to each one in sequence until a connection is done successfully.

With homie, I can only set one server and in case that one fails, i have no option for failover.

Where would be the best place to modify the homie code to support multiple ips for broker?

Or could it be done in the Homie.onEvent by adding an mqtt connection failure and changing the ip for the broker on the fly after each failure ?

Thanks

jpmens commented 8 years ago

I fail to come up with a reason why I would run more than one broker (except for testing purposes of course), and if I did I would bridge them together in order to be able to have one endpoint to which my clients connect and have the brokers distribute the messages via the bridge to other brokers.

Do you really need to connect to one of three different brokers?

vladbabii commented 8 years ago

For a home automation system the most important part is the reliability and stability of the platform.

For example: my system cam work 12 hours without power (the main security parts work up to 48 hours), has batteries on almost all nodes (except a few ones that send data i don't care about right now). I can pull the plug / poweroff / reboot / crash 2/3 of the infrastructure and you would not notice more than a minor delay (up to 1 second) when the parts of system goes down.

For wifi, I have multiple AP's that cover each other, so if half of them fail you can still have the esp8266 / arduino devices working. The normal AP's have very good range/signal but only work for 8 hours on battery, the failover AP's can run up to 72 hours on batteries and start in 10 seconds after main AP's don't respond.

For mqtt, I have multiple brokers. If a broker goes down, the esp / arduino clients connect to another one.

The other way of providing failover / high availability for mqtt is to have 2/3 load ballancers before 2/3 mqtt servers, sharing an ip between them (when one host goes down, the public-facing ip is moved to another server).

Setting up and maitining a floating IP is much harder than modifying the homie code for my use case - to use one of (mqtt ip, mqtt ip+1, mqtt ip+2) in case of failure to connect.

The three brokers are clustered so in the end it does not matter to which one the client connects, but if i have to reboot / shutdown one of the servers all the clients will migrate to another server.

On boot, the 3 ip list will be randomized so not all nodes fail to the same server and values 0 1 and 2 will be put into a small array to indicate the ip of the broker. Also, the index of the mqtt current server will be set to 0 (to try first in list)

On failure to connect:

do index++
if index = 2 (last server was tried) - list is randomized again, and then set index=1

Of course I could change 3 to any number, so I could support any number of servers.

Just to be clear, I'm not asking for you to modify homie, but where would you modify it to be less intrusive so I can have a higher chance of modifying code and still work in the future with the least amount of probable hassle.

Of course, if you have different ideas on building a high-availability automation solution I would be very interested in hearing them (there are so many ways of doing things that I'm always learning something new)

Thank you for the quick reply!

jayaras commented 8 years ago

Could you solve this currently w/o a code change and just use multiple A records and a very low TTL? Not sure if all the moving parts would play well with that but if it does a lookup on each connect it could get you what you want and possibly even more flexible... If it does not do a resolve on each connect then you could reboot the device on disconnect.

Having the option for multiple brokers seems like a good idea even if the scope is just for diy and experimentation and would keep service dependencies down (I wouldn't want to run a DNS server if I didn't have to.. thats why we have mdns) or layer 4 load HA solution of some kind...

vladbabii commented 8 years ago

Adding another service (like dns / mdns) means you have to replicate that too, and the more you have, the higher the chance some configurations get out of sync or something fails in weird and hard to debug ways.

Right now this is what i have on my nodes (my custom code).

start running
connect to wifi ( up to 3 wifi networks can be defined, they are tried in order)
if all wifi connects fail and node.canselfreboot = true, wait 30 seconds then reboot, otherwise go to #0.
if connected, but no dhcp ip got, increment wifi configuration and go to #1
try to connect to mqttserver(0), if fail, try (1) (2) and so on
if not connected to mqtt server and tried > count(mqtt servers list) * node.retrymaxcounter, go to #4
if not connected to mqttserver, increment wifi configuration and disconnect wifi (results in go to #1)

For the whole network to fail i need to have:

no ac power or 5/5 broken ups (1 server only has 1 ups)
3/3 broken managed switches
3/3 proxmox servers - broken / stolen / burned servers / ?? - if they crash, they have an embedded OS separate from the main one that allows remote console and i have some scripts there that reboot the host os if something locks up - so they would have to crash in a very weird way so they're half working; servers in different parts of the house, one is very well hidden
a broken 4th server that is in another country and has limited HA capability and connects to the home via two different internet connections (which have both a cumulated downtime of 15 minutes over the last 5 years - out of which 3 seconds of concurent downtime that is probably me messing with stuff)
3/3 broken routers (each router shares 2 internet connections with the other 2 and then has one individual 3g/4g usb stick)
6 broken AP's
3 broken containers with mqtt server that accept connections but do nothing with the data
a few dozens of scripts that monitor everything and restart / vote primary / alert me when something goes down

if one server with one ups, any one of the 3 switches is working i have a fully working system with partial working nodes.

This is about 2 years of on and off working weekend to setup. The server hardware is ok, the only hardware issues have been dead drives (which zfs handles very well and each server has 2 spares for a 3/4-way mirror).

Nodes are made by: ip cameras, ip energy switches, zwave controllers (2 set as primary/secondary controller) as proxy to zwave devices, unify controll center, pfsense routers, bluetooth scanners (for in-house location), wifi ir blasters.

The only think I lack right now to make the esp8266 sensor nodes better, so I'd like to learn the homie framework and make some changes for myself and hopefuly contribute useful code to the project.

Piling up extra services is not an option, because the more stuff you have the more it breaks in new and exciting ways. My goals for building this system are

simple to understand and as little 'magic' as possible
high availability / redundancy
quick response
no single point of failure
easy way to restore things in case everything breaks

Ultimately, using dns to do failover may be a better option than modifying the library and would mean less headaches in the future. Load balancing would be done by replying with a different IP every X seconds and some backend scripts to synchronize the 3 dns servers but then - what do I do about the dns server ips ? do i put them in the configuration? or set them using dhcp? this is moving haproxy (or any other reverse proxy software) from in front of the broker to in front of the dns servers...

You've given me some things to think about...

marvinroger commented 8 years ago

Well, to be honest, the current ESP8266 software and Arduino implementation, although way more stable and bulletproof than it was before, is not really production-ready, and I think it should not be used in critical environments like yours. Thinking about HA for homie-esp8266 is overkill, IMO. The problem is adding multiple broker IPs would make a bigger config.json, a bigger JSON object to parse (whereas now, the maximum size of the JSON object is known in advance, so it is allocated statically), and I am pretty sure this will benefit to 0.01% of users. We cannot sacrifice the other's 99.99% RAM to support this feature, and I am pretty sure you understand this. :)

vladbabii commented 8 years ago

I understand. I actually modified the code to try set in +0-3, so if your ip ends in 100 it tries all from 100 to 103. That was a month ago. I looked again at the code here on github.com and it changed a lot, so I'll probably do that again. I just set initial ip and max count of brokers. Also I use keepalived to move ip address between brokers (kinda overkill but it is easy to setup).

If you think it would be useful to provide to others my modifications I could keep a fork updated or a patch in the future

Thank you for your time

marvinroger commented 8 years ago

I am sorry, and glad you understand.

Please do as you want, but maybe you would better wait for the release of the v2. A lot of code might once again change before the release.

marvinroger commented 8 years ago

Thanks for wanting to contribute, by the way. 😉

rigorm commented 8 years ago

BTW, if you really want multiple mqtt servers, you could use a load balancer software like haproxy combined with keepalived and a VIP setup in a active-passive mode. Allé free software.

Télécharger Outlook for Android

On Fri, Oct 7, 2016 at 3:14 PM -0400, "Vlad Babii" notifications@github.com wrote:

I understand. I actually modified the code to try set in +0-3, so if your ip ends in 100 it tries all from 100 to 103. That was a month ago. I looked again at the code here on github.com and it changed a lot, so I'll probably do that again. I just set initial ip and max count of brokers. Also I use keepalived to move ip address between brokers (kinda overkill but it is easy to setup).

If you think it would be useful to provide to others my modifications I could keep a fork updated or a patch in the future

Thank you for your time

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

vladbabii commented 8 years ago

@rigorm - yes, you can add a haproxy or something in front, but it's a slippery slope: you start with 2 machines and you want to protect yourself you add a third (the reverse proxy), then you need to have at least two of those so now you have four. Or you put haproxy on each of the two but then you still bet everything on a floating ip or doing magic things with a switch (or two, to have redundancy). All of this can be avoided by allowing the client to have one ip set, and a maximum range of ips. So for two extra variables (one to keep max range and one to keep track the of last connected increment) - or just one if you do some fancy coding you avoid everything above that.

Right now i have 4 nodes that all have equal standing and run:

distributed mqtt server (each node sends data to all peers) - an early version on github, still working on it heavily with tcp and socket connection
web services with apache / nginx depending on least amount of work to implement (will be consolidated in the futre)
couchdb database (in sync with all others, with scripts that manage network split/join)
have equal standing to receive a few ips that are shared with keepalived
tincd connection to remote nodes that are used as remote connection relays
pm2 manages cluster services based on a custom plugin that works similary to keepalived, allowing one service to have a number of defined instances and ensuring the data is kept in sync across all servers

Remote nodes each run

tincd for dinamic peer-to-peer vpn
haproxy for as front-end reverse proxy
apache/nginx and others for caching and running a couple of services locally

This is an adventure in home automation and one in discovering as many ways to do many-to-many data replication and failover. This is the core of a home automation system, running on very low-power devices.

The short version: I know how to and I do implement high availability services, but I always prefer less complexity instead of more.

Bee-Certain commented 7 years ago

Please don't take this too seriously. Sensors are cheap. For simplicity, consider a high reliability thermostat: Three independent temperature sensors going to three separate controllers with one MQTT broker each. Each controller has its own set point and issues a contact closure command to a separate Homie-driven relay. Wire up the three independent relay contacts to vote. In an extreme case, you could have three separate heaters, each with a separate power source.

I know that's a lot of foolishness, but commodity hardware can achieve a lot. Now, about common mode software failures, ...

homieiot / homie-esp8266

Multiple MQTT brokers #107