merbanan / rtl_433

Program to decode radio transmissions from devices on the ISM bands (and other frequencies)
GNU General Public License v2.0
6.04k stars 1.31k forks source link

MQTT Birth and Last Will and Testament (LWT) messages #1394

Closed kpine closed 3 years ago

kpine commented 4 years ago

Add support for sending Birth and LWT messages. This would make it trivial to know when a rtl_433 is running or has gone offline.

zuckschwerdt commented 4 years ago

Good idea. Suggestions on the topic and content?

peterchs commented 4 years ago

For reference tasmota publishes these messages on a reboot;

tele/tasmota_D851A7/LWT Offline tele/tasmota_D851A7/LWT Online

So perhaps;

On startup: rtl_433/$HOSTNAME_OR_ID/LWT Online

On termination/exit: rtl_433/$HOSTNAME_OR_ID/LWT Offline

kpine commented 4 years ago

I'm not sure if it's better to have a single topic with different payloads, or separate topics.

But, I would probably prefer the single topic with different payloads, something like <topic prefix>/status with payloads offline and online. That's an approach I've seen elsewhere [1] [2]. You could include other information besides the process status in a JSON payload, if desired.

I'm not using any sort of hostname or instance identifier in my topics, since I only have a single service running (maybe I should be). My MQTT config is:

output          mqtt://hostname:1883,retain=0,events=rtl_433/events,states=rtl_433/states,devices=rtl_433/[protocol][/channel:0][/id:0]

So I'd want to see the status at rtl_433/status or so. Not sure if that's possible to derive automatically, or if it would need to be a separate config option (e.g. ,status=rtl_433/status). Even better, but a different subject, would be able to allow a custom topic prefix.

Also, speaking of Tasmota, I came upon a Github issue related to this. Probably would want offline status being reported even in the case of a DISCONNECT.

zuckschwerdt commented 4 years ago

We could introduce prefix= to clean things up. Not sure if many people use all of events, states and devices. The events and devices carry the same information in different formats.

Perhaps prefix/LWT is better than prefix/status as there is already prefix/states and that is confusing.

We have three different status messages available:

We would aim to always disconnect cleanly, thus the abnormal connection lost case could get another, more serious message.

What would parse the message and what are the expectations there?

kpine commented 4 years ago

The events and devices carry the same information in different formats.

Thanks for the reminder, I've removed it, it was a holdover from some previous testing.

What would parse the message and what are the expectations there?

In my case, I was planning to use this with Home Assistant MQTT sensors, which have an "availability topic" and configurable payload. If rtl_433 would update this availability topic to online or offline, the sensors will become available or unavailable respectively, so it would be trivial to know when things are in a bad state.

After a little more research though, I'm not sure that's the correct approach, or at least it's not a complete approach. It would tell me that rtl_433 is down, which is important. It wouldn't tell me anything about individual sensors, like if a battery died and the sensor stopped reporting. There are other ways to handle that which I need to add anyways. So tying this to individual sensors may not be the right idea.

Assuming I wanted to use the availability topic approach anyways, HA would expect a single topic (I think it supports wildcards but I don't know how that would work, never seen it used) and two payloads, representing offline or online. For example, I think Tasmota does this:

topic: <prefix>/LWT, message Online or Offline.

Adding a third message, like Connection Lost, would not work. HA requires one available string and another unavailable string. The seems to be the most common behavior that I've seen so far with other implementations. For my use case the reason for being offline is not very useful anyways. I imagine it might be for someone, but I just need to know if the service is running or not.

If I toss out the idea of using the HA availability topic, then the above doesn't matter much and the solution could be just about anything:

Also, even if rtl_433 is configured with the retain option off, you'd want to always retain these messages regardless so clients that connect after an exit would know.

zuckschwerdt commented 4 years ago

Thanks for the detailed analysis! It could be reasonable to assume that the user knows how to interpret the "Offline". Either as "that happens, I'm just testing" or "Critical, something important failed", which is to say users that never want to see "Connection lost" also never want to see "Offline" and would act on both, right?

Aside: I watch my batteries by adding all "battery_ok" fields and graphing that to quickly gauge the state of things. In fact I log all "battery_ok" fields to a TSDB and let Grafana add them to a precentage gauge ;)

peterchs commented 4 years ago

What is connection lost in regards to? Is that to the USB device? If it's to MQTT broker surely that message cant be published?

Online/Offline LWT may be useful for me so I would know specifically which device was down, or to diagnose - pull the RTL usb and it would give me a MQTT Offline and I'd know that was device #X.

Re: Low battery another method is to use node red and existing MQTT messages, I have two patterns I use for all my rtl_433 received sensors, one to trigger notification/email if the battery_ok is 0, and another if a message hasn't been received over X time (feed sensor mqtt topic into a trigger which gets reset every time a new message is received, and if after X hours nothing is received trigger will then send message to email/notification nodes).

zuckschwerdt commented 4 years ago

LWT is any connection loss without clean disconnect. This could be network problems or crash of any component. rtl_433 usually shuts down in a controlled manner, even on USB trouble. But this isn't currently flagged to the MQTT broker.

kpine commented 4 years ago

Actually, when I originally created this issue I didn't quite understand that LWT was technically for ungraceful disconnects, not including graceful ones. I thought it applied in both cases, mostly because other software I use act this way. So that's what I phrased it as and it's probably a bit narrow of a description.

The problem I was having (am still?) was the USB issue ("Async read stalled, exiting") and a bug in Docker. rtl_433 would exit, and the container would hang, even with a restart policy enabled. Generally, just restarting rtl_433 has worked for me, I can't remember having to physically remove the USB device.

So the request is really for a general status flag that would be set on startup and shutdown. Whether or not it's graceful or not doesn't really matter it seems. I think including LWT just makes sense.

Since then I've got all my batteries displayed in both Grafana and Home Assistant, I just need to setup notifications for low ones. I had one sensor offline for over two months because I wasn't paying attention!

HA has built in support for handling expiration of sensor values, so I might be able to detect issues with rtl_433 shutting down as well. So the need for this specific request is probably lessened, but I think it's a good thing to have. :smile: If it's available I'll use it.

psa-jforestier commented 4 years ago

The problem I was having (am still?) was the USB issue ("Async read stalled, exiting")

About the "Async read stalled, exiting" : it happends from time to time on my Pi (maybe once a week). I just have to reset the USB port to make it work again. I compile usbreset.c (https://gist.github.com/x2q/5124616) and use "usbreset /dev/bus/usb/001/004" (assuming lsusb report the dongle on bus 001 device 004). So sad, the rtl_433 program do not report with an error exit code the "Async read stalled" program. Probably something have to change near https://github.com/merbanan/rtl_433/blob/ed7a743b40c79c95130de8c79dc2175fa3a5cd5b/src/rtl_433.c#L1136 but I do not success to modify the code to exit(42) in case of this error.

zuckschwerdt commented 4 years ago

Good idea, we should do that, exit with a defined code on different errors. Note though that stall errors can have multiple reasons. We just notice that the receiver does not deliver data anymore. For me (Raspberry Pi 3 Model B, 2x R820T2 rtl-sdr) that happens once a week because the dongle gets confused or when running cpu heavy tasks which might undervoltage the dongle. In my case no USB reset is needed.

zuckschwerdt commented 3 years ago

exit codes are implemented, LWT discussed in #1547. closing.

gdt commented 2 years ago

I have been dealing with this issue (not with rtl_433) in home assistant. I have settled on publishing to

Then, I set online as an availability topic, so that the temperature goes unavailable when the online topic becomes OFF, or if temperature has not been written for usually 2 * interval + 30s. The payloads are arbitrary, but it's nice if they line up with either binary sensor or availability topic norms within HA. I set up alerts for 'sensor offline' and 'sensor unavailable'.

I do wonder how much of this should be in rtl_433, vs the mqtt relay script. But it's nice to know if rtl_433 crashes, which the current UDP-to-script would not notice. Overall I think this situation could benefit from a step back and systems design thinking; the world is complicated.