Closed kpine closed 3 years ago
Good idea. Suggestions on the topic and content?
For reference tasmota publishes these messages on a reboot;
tele/tasmota_D851A7/LWT Offline tele/tasmota_D851A7/LWT Online
So perhaps;
On startup: rtl_433/$HOSTNAME_OR_ID/LWT Online
On termination/exit: rtl_433/$HOSTNAME_OR_ID/LWT Offline
I'm not sure if it's better to have a single topic with different payloads, or separate topics.
But, I would probably prefer the single topic with different payloads, something like <topic prefix>/status
with payloads offline
and online
. That's an approach I've seen elsewhere [1] [2]. You could include other information besides the process status in a JSON payload, if desired.
I'm not using any sort of hostname or instance identifier in my topics, since I only have a single service running (maybe I should be). My MQTT config is:
output mqtt://hostname:1883,retain=0,events=rtl_433/events,states=rtl_433/states,devices=rtl_433/[protocol][/channel:0][/id:0]
So I'd want to see the status at rtl_433/status
or so. Not sure if that's possible to derive automatically, or if it would need to be a separate config option (e.g. ,status=rtl_433/status
). Even better, but a different subject, would be able to allow a custom topic prefix.
Also, speaking of Tasmota, I came upon a Github issue related to this. Probably would want offline status being reported even in the case of a DISCONNECT.
We could introduce prefix=
to clean things up. Not sure if many people use all of events, states and devices. The events and devices carry the same information in different formats.
Perhaps prefix/LWT
is better than prefix/status
as there is already prefix/states
and that is confusing.
We have three different status messages available:
We would aim to always disconnect cleanly, thus the abnormal connection lost case could get another, more serious message.
What would parse the message and what are the expectations there?
The events and devices carry the same information in different formats.
Thanks for the reminder, I've removed it, it was a holdover from some previous testing.
What would parse the message and what are the expectations there?
In my case, I was planning to use this with Home Assistant MQTT sensors, which have an "availability topic" and configurable payload. If rtl_433 would update this availability topic to online or offline, the sensors will become available or unavailable respectively, so it would be trivial to know when things are in a bad state.
After a little more research though, I'm not sure that's the correct approach, or at least it's not a complete approach. It would tell me that rtl_433 is down, which is important. It wouldn't tell me anything about individual sensors, like if a battery died and the sensor stopped reporting. There are other ways to handle that which I need to add anyways. So tying this to individual sensors may not be the right idea.
Assuming I wanted to use the availability topic approach anyways, HA would expect a single topic (I think it supports wildcards but I don't know how that would work, never seen it used) and two payloads, representing offline or online. For example, I think Tasmota does this:
topic: <prefix>/LWT
, message Online
or Offline
.
Adding a third message, like Connection Lost
, would not work. HA requires one available string and another unavailable string. The seems to be the most common behavior that I've seen so far with other implementations. For my use case the reason for being offline is not very useful anyways. I imagine it might be for someone, but I just need to know if the service is running or not.
If I toss out the idea of using the HA availability topic, then the above doesn't matter much and the solution could be just about anything:
online
and offline
topics, e.g. <prefix>/available
. In HA I could subscribe to this topic and trigger on the payload. I could use this for sensor availability too.Online
, Offline
, Connection Lost
. In HA I would just check the payload value for Online
or not Online
. I could not use this for sensor availability.<prefix>/available
-> {"state": "offline", "reason": "connection lost"}
. In HA I would just check value_json['state'] == 'online'
. I could not use this for sensor availability.<prefix>/available/state
-> online
or offline
<prefix>/available/reason
-> connected
, exited
, connection lost
In HA I would just watch <prefix>/available/state
and check for online
or offline
. I could use this for sensor availability too and just ignore the reason topic.Also, even if rtl_433 is configured with the retain
option off, you'd want to always retain these messages regardless so clients that connect after an exit would know.
Thanks for the detailed analysis! It could be reasonable to assume that the user knows how to interpret the "Offline". Either as "that happens, I'm just testing" or "Critical, something important failed", which is to say users that never want to see "Connection lost" also never want to see "Offline" and would act on both, right?
Aside: I watch my batteries by adding all "battery_ok" fields and graphing that to quickly gauge the state of things. In fact I log all "battery_ok" fields to a TSDB and let Grafana add them to a precentage gauge ;)
What is connection lost in regards to? Is that to the USB device? If it's to MQTT broker surely that message cant be published?
Online/Offline LWT may be useful for me so I would know specifically which device was down, or to diagnose - pull the RTL usb and it would give me a MQTT Offline and I'd know that was device #X.
Re: Low battery another method is to use node red and existing MQTT messages, I have two patterns I use for all my rtl_433 received sensors, one to trigger notification/email if the battery_ok is 0, and another if a message hasn't been received over X time (feed sensor mqtt topic into a trigger which gets reset every time a new message is received, and if after X hours nothing is received trigger will then send message to email/notification nodes).
LWT is any connection loss without clean disconnect. This could be network problems or crash of any component. rtl_433 usually shuts down in a controlled manner, even on USB trouble. But this isn't currently flagged to the MQTT broker.
Actually, when I originally created this issue I didn't quite understand that LWT was technically for ungraceful disconnects, not including graceful ones. I thought it applied in both cases, mostly because other software I use act this way. So that's what I phrased it as and it's probably a bit narrow of a description.
The problem I was having (am still?) was the USB issue ("Async read stalled, exiting") and a bug in Docker. rtl_433 would exit, and the container would hang, even with a restart policy enabled. Generally, just restarting rtl_433 has worked for me, I can't remember having to physically remove the USB device.
So the request is really for a general status flag that would be set on startup and shutdown. Whether or not it's graceful or not doesn't really matter it seems. I think including LWT just makes sense.
Since then I've got all my batteries displayed in both Grafana and Home Assistant, I just need to setup notifications for low ones. I had one sensor offline for over two months because I wasn't paying attention!
HA has built in support for handling expiration of sensor values, so I might be able to detect issues with rtl_433 shutting down as well. So the need for this specific request is probably lessened, but I think it's a good thing to have. :smile: If it's available I'll use it.
The problem I was having (am still?) was the USB issue ("Async read stalled, exiting")
About the "Async read stalled, exiting" : it happends from time to time on my Pi (maybe once a week). I just have to reset the USB port to make it work again. I compile usbreset.c (https://gist.github.com/x2q/5124616) and use "usbreset /dev/bus/usb/001/004" (assuming lsusb report the dongle on bus 001 device 004). So sad, the rtl_433 program do not report with an error exit code the "Async read stalled" program. Probably something have to change near https://github.com/merbanan/rtl_433/blob/ed7a743b40c79c95130de8c79dc2175fa3a5cd5b/src/rtl_433.c#L1136 but I do not success to modify the code to exit(42) in case of this error.
Good idea, we should do that, exit with a defined code on different errors. Note though that stall errors can have multiple reasons. We just notice that the receiver does not deliver data anymore. For me (Raspberry Pi 3 Model B, 2x R820T2 rtl-sdr) that happens once a week because the dongle gets confused or when running cpu heavy tasks which might undervoltage the dongle. In my case no USB reset is needed.
exit codes are implemented, LWT discussed in #1547. closing.
I have been dealing with this issue (not with rtl_433) in home assistant. I have settled on publishing to
Then, I set online as an availability topic, so that the temperature goes unavailable when the online topic becomes OFF, or if temperature has not been written for usually 2 * interval + 30s. The payloads are arbitrary, but it's nice if they line up with either binary sensor or availability topic norms within HA. I set up alerts for 'sensor offline' and 'sensor unavailable'.
I do wonder how much of this should be in rtl_433, vs the mqtt relay script. But it's nice to know if rtl_433 crashes, which the current UDP-to-script would not notice. Overall I think this situation could benefit from a step back and systems design thinking; the world is complicated.
Add support for sending Birth and LWT messages. This would make it trivial to know when a rtl_433 is running or has gone offline.