homieiot / convention

🏡 The Homie Convention: a lightweight MQTT convention for the IoT
https://homieiot.github.io/
Other
715 stars 60 forks source link

$online - unexpected connection lost #24

Closed timpur closed 7 years ago

timpur commented 7 years ago

Wondering if its better to use the last will to post information about unexpected connection lost

Currently disconnecting correctly and unexpected are the same = false at $online

Wondering if its better to use a number, 1 = online, 0 = offline and -1 = unexpected connection lost ??

Tell me what you think. Tim.

marvinroger commented 7 years ago

Good point. What do you guys @kwave @ThomDietrich @euphi think about this?

I'd prefer an explicit text instead of code numbers. $online would not be a binary state anymore, so maybe we should change the attribute to $state or $status (don't know what's the most correct in english). The values could therefore be online, offline (clean) or lost (unexpected disconnection).

euphi commented 7 years ago

$state or $status is good (I'm unsure which one is better, both are ok).

For me, offline is too generic for a clean disconnect, so maybe disconnected is better. There may also be a state sleep for deepsleep. (However, this is a little bit problematic, because the state will stay in sleep if the device cannot reconnect for other reasons.

Also, you could send a state after MQTT connection:

Just after this, Homie shall send the online state. This allows to see a reset/reboot reason by subscribing to $state.

timpur commented 7 years ago

I like these ideas. Maybe have both? State is simply on or off ? Status is more complex, Connected , disconnected, lost, sleep, etc. ?? This way we have best of both worlds, do not need to change the standard, but also adding more information about the devices current status?

In dress to @euphi

reconnect if only MQTT connection and/or WiFi was lost reset if device was resetted poweron if there was a power failure wakeup after deep sleep (On ESP8266 this seems to be possible from API using the struct resetInfo but it seems this is somehow buggy?)

Not sure about these. These seem more like events? Maybe we could add events topic? but this is difficult because some of these events occur when the device is disconnected? Reason to split these up is, State and Status should always be set to the current state or status, so that when ever you subscribe to these topics you receive the current status or state of the device no mater when you subscribe to these topics. An Events topic would describe the the events that occur, but not necessarily the event its currently in (maybe make this not retained pub?). If this make sense.

Anyways thats my thoughts. Tell me what you think.

euphi commented 7 years ago

A simple $state and a more detailed $status are not possible, because the "unexpected loss" logic is based on the "last will". MQTT allows for only one last will.

euphi commented 7 years ago

The "reconnect" events are useful to see the reason of a lost connection.

The advantage of having a own state for this, is that it you can see also short connection losses (that does not trigger the last will) by subscribing to $online.

When you think about integration with home automaton systems, it is really useful to have one property to see reboots of a connected device.

timpur commented 7 years ago

Valid point about last will. It will have to be on one topic. I agree with your ideas. Just not sure if there should be combined? I think state and events should be kept separate. Reason for this is when an event occurs and publish the message, then do you also publish over with the current status. (publish reconnecting then connected ?). Also we could also use this events topic for custom publishing of events, and thus do we worry about publishing over these with the current status?

That's why i think they should be separate.

Thus Status and Events is my conclusion but its not up to me its up to the community.

gorootde commented 7 years ago

If the goal is only to differentiate between "clean" disconnects, and losses due to broken connections: I would use marvin's versions: online, offline, disconnected published on the existing $online topic. The additional state info suggested by euphi (reconnect,reset,poweron, wakeup) are very device specific. As the Homie convention is device independent, I think these should not be a part of it. (e.g. What if my device has no Wifi and has no possibility to detect why it was reset?)

euphi commented 7 years ago

I think it is very important that if a device reconnects it is made sure that it sends on the $online topic a value other than online before sending online again.

ThomDietrich commented 7 years ago

Just to say something: I'm undecided 😄 The task of the current $online topic with it's special property to be the LWT topic is to inform an interacting device/application if the device with it's furthermore published topics is available and in an operating and "interactable" state. If everyone agrees with this definition, the $online topic should transport a clear "yes, interactable" or "no, not interactable" message.

@timpur @euphi could you please specify in which constellation a communication partner would need a more specific info than "Device not available for interaction"?

I'm not trying to silence your ambitions. The idea of an extended LWT-linked topic or the addition of an additional one might be meaningful but we need to find a use case for that!

euphi commented 7 years ago

I have a rule in openhab that sends me a telegram message if a device is offline. This rule is only active for the deployed devices, so I also want to know if they restart, because this is a sure sign for a problem.

Currently the rule is based on detecting a negative step in uptime, which is somehow unreliable for restart of openhab or in case of overflow and does not work for a short loss of Wifi connection.

Especially a short loss of Wifi connection was not detected at all. This was quite problematic, because it happened to my irrigation controller and it seems that filling the water tanks to 100% and leaving a ladder (metal) close to the controller just before leaving for holidays worsened the Wifi connection, so sometimes it was lost for more than one day... I now learned that signal quality of 8% seems to be the minimum for connection, so I set up an alert rule in grafana (influxdb) that notifies me if a device as connection quality of < 12%.

(BTW: I have 5 devices running permanently, two of them for more than 1 year, the others for several months, with different versions of Homie-2.0.0 and there was no unexpected reset yet. One is running for more than a year, the other had some resets due to improvement of installation (wiring etc.), in one case the circuit breaker of the 230V supply tripped. In this case the alert rule was also helpful to detect this problem that also affected the fridge, dish washer and oven...)

marvinroger commented 7 years ago

That's a very valid use case. If we rename $online to $state, we keep a offline LWT message, and we send boot when the device is "booting" (and that's actually a state, not even an event), ready when the device is ready, and disconnected when it is cleanly disconnected. It would solve this, right?

timpur commented 7 years ago

Got my approval. Just not sure about the state names.

Boot (first thing it does when connected is send boot). Finishes booting and sends ready. Then disconnected when clean and lost when not clean disconnect?

Also don't think you have to get rid of $online could have both? Up to you.

euphi commented 7 years ago

Most important now is to come to a decision to use only one topic with fine granulated state or simple states and an extra topic for detailed information.

States should be generic, but we can use Homie-ESP8266 as example what is technically possible.

Thus said, states would be:

Personally I prefer to have all this states above on one $state topic. "connect" is not necessary then. "disconnected" may still be useful if there is another reason for disconnect than "update", "sleep" or "poweroff".

BTW: It is intended that the fully connected states start with "on" and the fully disconnected states start with "off". This helps to have simple pattern matching rules.

timpur commented 7 years ago

all i wanted was to know when the device was 'lost' :P, but having more clarity certainly helps and i think its a good idea. That said im not sure what is the right choice in all the options.

I think both @euphi and @marvinroger have the right ideas, so trust you guys to work this out (the fine details) :P.

bertmelis commented 7 years ago

I wouldn't over-engineer this and take 2 seperate topics. One mandatory LWT-topic with only 3 possibilities and one implementation-specific with as many possibilities as needed. So it is very clear if the device is available or not. If you want to know more: check out the (for example) $implementation/event topic.

As you can only have one testament, you cannot distinguish between powerloss and crash for example. So you'll lose information anyhow.

PS1 How would you send "booting"? When the device is booting I would think there's no MQTT-connection yet.

PS2 Keep in mind that MQTT allows for delivering messages ordered but it is not by default guaranteed. Although Homie will probably work with the different states that change quickly, it is in theory undefined.

ThomDietrich commented 7 years ago

@euphi I'm with @bertmelis and @marvinroger on this one. I'm not sure if we need all these disconnect substates. Also the "reconnect" state is not a state but an event. But without the intention to disregard all your states, could you name those you are interested in specifically and describe why the convention needs to unify them across clients and controllers?

@marvinroger @timpur trick question: If we define the LWT message "offline" and additionally publish a "disconnect" message in certain cases, is it okay to still receive the LWT message? I'd need to check but looking at the implementation that might become tricky. At least the python paho-mqtt package as I know it doesn't offer the option to reset or delete the LWT message while connected.

Edit: It seems to be possible, still the method could get tricky to implement and we should consider that while specifying $state. @marvinroger I trust you have more low-level experience with that.

marvinroger commented 7 years ago

@bertmelis I meant "booting the Homie announce" more than an hardware boot.

@ThomDietrich discarding the LWT is indeed only possible by cleanly disconnecting, and for MQTT, this means sending a DISCONNECT packet, waiting until TCP ACK, and then closing the connection. This is not tricky; all spec-compliant MQTT clients should support this. Even the low-power ESP8266 supports this with the async-mqtt-client client.

euphi commented 7 years ago

Minimum states that are necessary (Textual description):

The last one ist necessary to make auto-discovery reliable (e.g for openhab). Without this state, the controller cannot ensure that it has received all attributes/properties yet.

This could be especially problematic with bad network connection.

bertmelis commented 7 years ago

To make sure all your requirements are met, you also have to be sure the broker sends all the messages in order. With QOS1, this is only the case when the maximum number of in-flight messages is only 1, which is not by default on mosquitto.

euphi commented 7 years ago

Yes. However, the homie-implementation can wait for the messages to be acknowledged, before sending the next message. This can be achieved with a simple state machine.

The attributes and property announcement can be send at any order, so there would be only 3 phases necessary: "fresh connect" [send $state -> "connect"], "announcement" [send all device attributes and property announcement in arbitrary order], and "connected" [send $state -> "online"].

marvinroger commented 7 years ago

So, do we agree with that:

@euphi I think you forget one thing: what if the controller connects after the device announces itself? The ready $state might indeed be received by the controller before the other messages. So there are two solutions:

euphi commented 7 years ago

The value of the device statistics are not part of the initial announcement yet. So we need a mandatory $stats topic anyhow to auto-discover the $stats attributes. :-)

Then, only $implementation and the property attribute $unit is optional. For $implementation its up to the implementation to make it auto-discoverable. $unit then should be made mandatory with an empty string as allowed value.

marvinroger commented 7 years ago

I was thinking exactly the same thing about $unit. 😉

Let's do that!

bertmelis commented 7 years ago

I don't know if an empty string is going to work as everything is sent as retained. An empty payload means the topic will be deleted by the broker. Maybe you should use none. Or am I missing something?

Edit: my comment is not valid, since we're sending with QOS1 hence an empty message will be retained.

marvinroger commented 7 years ago

@bertmelis no, you're right... I thought empty strings were 1 byte long (null terminator), but I was wrong, the null terminator is not included in the payload. Therefore, the message would not be retained. Let's go for none.

euphi commented 7 years ago

none, NULL, null, void, or NIL? :-)

marvinroger commented 7 years ago

Done: 0003daace455b835bc64023615c3d10a5fa257d9

We only need to agree with https://github.com/marvinroger/homie/issues/24#issuecomment-321784570 and we should be good 😇

euphi commented 7 years ago

I could live with these 4 states only and their propose names (ready is better than online, init better than boot). Maybe offline could be renamed to lost?

However, I still think an extra state for deep sleep would be useful. The difference between disconnected and sleeping is, that disconnected is a planned shutdown (e.g. closing the Homie-Program), while sleeping is a state where the device is supposed to reconnect frequently.

Regarding an alert state: I'm used from my "domain" at work that device have some self-diagnosting abilities. The simplest variant to do this, is to tell the controller in a standardized way, that there is some major problem. Having this in the convention is much easier than to have an implementation specific solution.

marvinroger commented 7 years ago

Then:

How would you implement your alert state?

euphi commented 7 years ago

alert would have the meaning "homie is running fine and is connected, but the device is not able to perform its main task", e.g. Sensor is providing no data. So setting the alert-state is up to the "Application" that makes use of Homie.

The actual homie-implementation (e.g. homie-esp8266) should offer an API call to set the alert-state. Everything else, e.g. provide diagnosstic messages is implementation specific and out of scope of the homie-convention.

I'm aware that this is an extra feature, but I think a standard for this is quite useful for integration. It helps monitoring your devices.

For example you can easily shown them as "yellow" an a Homie "Dashboard": green OK, red lost, blue sleeping, white disconnected, orange: alert.

marvinroger commented 7 years ago

Alright:

Is everyone ok?

timpur commented 7 years ago

I like :)

ThomDietrich commented 7 years ago

Hey, I'm back from the weekend and like what I'm reading 👍

I also agree that the "alert" and "sleeping" states are meaningful and can be useful for certain cases.

Regarding the discovery process: I'd still strongly vote to follow the idea of

you can consider some safe defaults and as you receive the new attributes, you can improve your knowledge about the device

Despite a few fundamental topics, all other topics should be optional and hence provided with a default in the convention. A discovering controller (like the openHAB binding) has to make sure to catch all retained messages before further processing the devices/nodes/properties by applying a simple delay/timeout mechanism. That should be easy and other services work similarly. @kwave what do you think?

Back to topic: "I like :)"

timpur commented 7 years ago

@marvinroger Plans moving forward ? Have we concluded on the new structure? Implemented in beta 2 ?

euphi commented 7 years ago

A reference implementation is part of the implementation, so I created an issue at https://github.com/marvinroger/homie-esp8266/issues/390.

marvinroger commented 7 years ago

Indeed, let's move forward.

Can we guys read the whole thing at https://github.com/marvinroger/homie/tree/redesign and write your final comments/thoughts here?

timpur commented 7 years ago

New standard looks amazing and solves so many of my issues that i solves with just having my own standard like also adding a "units" attribute.

(TODO Clarify device behaviour in extra section for $online (or $state) topic.)

Not sure how to comment on this though ?

marvinroger commented 7 years ago

Closed by https://github.com/marvinroger/homie/pull/50