merbanan / rtl_433

Program to decode radio transmissions from devices on the ISM bands (and other frequencies)
GNU General Public License v2.0
6.07k stars 1.31k forks source link

RFC: Data sanity checking and filtering in rtl_433 #1964

Open rct opened 2 years ago

rct commented 2 years ago

For various reasons rtl_433 may output messages with bad data, such as humidity values that fall outside of the range of 0 - 100. For most protocols message integrity checks (MIC) based on checksums, CRCs, and/or parity bit should cause bad messages to be rejected. However this doesn't always work. MIC are not 100% accurate. In some cases it is possibly hardware error with the device itself, a sensor might be bad or have a temporary glitch, but the device calculates the MIC correctly with the bad data. (There have been cases in some decoders identified where a bitbuffer of all zeros could pass the MIC checks like checksum. In a number of cases, those decoders have had checks added to prevent this.

This is a topic that at first glance seems somewhat trivial, but it isn't. Points of view may change after a year or more of operating a system with a variety of sensors. There is no one correct answer to this for all applications. I've been thinking about this for a while, but haven't tried to implement anything yet because I don't have a view yet for what is the correct thing.

The case for not filtering in rtl_433

This leads to the question of whether rtl_433 should do some sanity check of data, An argument against doing sanity checking in rtl_433 is an architectural view that rtl_433 should do its best to handle the RF and lower layers of the bit manipulation in a very general manner and then get out of the way to be of use to a wide range of consumers.

Related to this is the architectural view that rtl_433 should not be maintaining any state for devices, that should be up to consumers and state requirements might be not only consumer specific, but could also be device specific.

Not doing any filtering/sanity checking makes sense when application specific code reads from rtl_433 and can make it's own decisions. For example:

rtl_433 -> application with filtering -> database/view applications

rtl_433's ability to implement sanity checking is limited by not keeping state

The architectural view that rtl_433 shouldn't maintain any state limits the types of sanity checking that can be done within rtl_433. Data that is within range but still potentially bad can't be filtered by rules that need history. For example rejecting values because the rate of change exceeds a threshold.

Not keeping state in rtl_433 does have many benefits for a clean design. It helps keep rtl_433 general and usable for a wide range of uses cases beyond just receiving weather and environmental data. An application that uses rtl_433 to receive data from security sensors has different requirements than a weather console.

Common use cases that need filtering (for lack of sanity checking in other applications)

However in the last few years, rtl_433 has added generalized publishing for messaging (MQTT) and storage (InfluxDB). A common use case is:

rtl_433 -> MQTT -> Home automation platform (Home Assistant). 

In addition to display data, and acting on it an automation platform like Home Assistant may be distributing the data to SQL DB, a time series DB like InfluxDB, and other messages like MQTT. In the Home Assistant case, there is currently no capability for filtering / sanity checking incoming data in a general way. So any bad data winds up everywhere, displays, graphs, alerts, and automations are all impacted. It could be argued that it is consumers like Home Assistant, InfluxDB, that need to change and implement some sanity checking. But the question is how practical is that versus changes in rtl_433?

It could be argued that adding filtering to rtl_433 adds complexity in a place where it doesn't belong. It could also be argued that rtl_433 possibly should not be publishing directly to messaging systems or long term storage systems like Time Series Databases (TSDBs). The question is whether to a large monolithic application or a smaller application that follows the UNIX or microservices-like philosophies.

There is a lot of value in keeping things simple. With Home Assistant, it's possible for people who don't want to get into the details to put together a some what off-the-shelf system with an RTL-SDR to get access to many RF ISM band devices. There are containers for running rtl_433, MQTT broker, and the rtl_433 MQTT auto discovery, without much command line work. This has opened up rtl_433 to a much wider audience.

When not to filter

There are cases when filtering may not make sense or could be detrimental. Messages often contain more than one piece of data. Devices often contain more than one sensor, for example temperature and humidity.. It's possible for one sensor to be bad, but other sensors to still be good.

If the humidity sensor on a weather station goes bad, should all data from the weather station be discarded until the humidity sensor is fixed?

(This is a case I've observed. humidity sensors for outdoor weather stations can become contaminated.)

If some sensors are giving bad data because the battery is low, would filtering prevent messages with the battery low indication from being seen?

(This is a case I've observed. I've had a device where the temperature goes to -40 and the humidity 0 when the battery is very low, but not fully dead yet.)

With filtering enabled, how difficult is it for a user to differentiate between reception problems and a bad sensor causing messages to be discarded?

Where to implement sanity checks and filtering?

Ad hoc checks in decoders

It's tempting to eliminate bad data by implementing ad hoc checks in decoders. Right now this is the only place to get any immediate relief from bad data.

To make it clear that messages are being rejected, there is usually a debug fprintf to stderr, but that is protected by if verbose mode check. So each of those checks is a multi-line block that violates the DRY principle. So the DRY problem is going to multiply as sanity checks are added to more devices.

Given the potential problems mentioned above in when not to filter, it's unlikely that all of the ad hoc checks can be easily enabled or disabled as needed without changing code.

As for the idea that all it takes to figure things out is to restart rtl_433 in verbose mode, consider users who are running rtl_433 in a container on some sort of embedded system like their home automation platform.

Finally, there is an architectural view that decoders should be fairly straightforward to write, maintain, and test. Implementing a lot of ad hoc

Data layer (aka data make)

The data layer could be a useful place for controlling whether to filter or not filter, or even whether certain clients should get a filtered stream or a raw stream.

The data layer could also be used as a way of passing messages (or data elements through) but marking them suspect.

Some data elements like humidity could be filtered without decoder specific knowledge. However something like temperature can not. The range for an indoor sensor is very different than a cooking (oven, bbq) sensor.

The data layer would need info from the decoders about valid ranges. So this would require a reasonably large amount of changes.

A downstream "helper" application

A downstream helper application that reads from rtl_433, filters, and then distributes to some or most of the endpoints that rtl_433 is already supporting could solve a number of problems:

The examples directory is supposed to have things like this, but I don't think any of them are in wide enough use to really gain any traction.

Building a companion app for rtl_433 that handles filtering and distribution might help rtl_433 be less monolithic and might make things easier in the long run. Using an event-based framework that already has support for communicating with various types of data stores and messaging systems could help making things easier and more reusable.

Other considerations

Should the entire message be discarded for one bad data element?

There are cases outlined above where it would be useful to still get the message even though one data element is bad, namly one bad sensor on a multi-sensor device.

On the other hand, if it is a message integrity problem, one bad data element can a helpful indication that other elements might also be invalid even if they pass range checks.

Discard or mark messages with bad data?

Discarding a message with bad data is helpful to simple consumers that accept everything. However this also prevents other consumers from seeing potentially useful information.

Marking messages with bad data, would be more generally useful but would require changes in consumers to discard messages that are marked as suspect. This would currently be difficult in say Home Assistant using the common rtl_433 -> MQTT -> Home Assistant topology.

Determining why a device is isn't being received or is periodically dropping out is challenging

At the time of this writing, tracking down why a device isn't being received, or worse, is only intermittently being received is currently laborious and requires a not insignificant amount of skill and deduction. As mentioned above if whole messages are discarded, data problems can be difficult to distinguish from reception problems.

The developer experience might be to test one device at a time with running rtl_433 in verbose mode and possibly removing the antenna to help in isolating the one nearby device under test.

However end users, such as the home automation use case may have many devices covering temperature, humidity, water detection, security, presence, energy, etc. So differentiating between an RF problem, a device problem, or a problem with the code presents more challenges in this case. The techniques a developer uses would be fairly disruptive to shut down the home automation system during the length of the debugging, run things manually, relocate devices, remove the antenna etc.

Conclusion

This is a request for comments and discussion. It would be good to collect not only opinions, but also operational experience that might help guide this. Thanks for making it this far.

Edit: Added Other considerations section.

zuckschwerdt commented 2 years ago

To add an example thought on the unix philosophy (but please don't let this distract from the topic here): I've been thinking for some time if splitting out the hopping logic to an external controller would help. rtl_433 can switch band, rate, and options easily with e.g. the http api. The requirements for hopping logic (conditions, time, timeout), are currently somewhat distracting in the code and don't fit everybodys needs. It is a very nice feature to have ready-to-go otoh.

qwindelzorf commented 2 years ago

One potential concern I have with pushing the task of sanity checking into a downstream helper application is that now we are introducing multiple places in the project where you need to have detailed knowledge of the devices being supported.

As an example, temperature sensors. A few of common use cases are weather (never gonna see >100C) and cooking (should see >100C regularly, but never >700C). But what if I want to use a high temperature thermocouple to log data from my back yard forge (1200+C, but realistically never valid below 200C)? The filter application would have to know what the different sensors are capable of, and what sort of values should be expected for each device. So now there is device specific info in two places -- inside rtl_433 in the decoder, and also in the helper/filter application. This would veer into violating the single-source-of-truth mentality.

Overall, I like the idea of pushing filtering into a downstream application. It seems like a good way to handle separation of concerns (decoding vs filtering). But we would also need to come up with a way to handle device specific knowledge/limitations/edge cases generically.

DAVe3283 commented 2 years ago

After thinking about it, I tend to agree that rtl_433 should only be doing integrity-type checks (parity, checksum, other ways of detecting an entire message is corrupt). Unfortunately, sometimes a bit flip in the wrong place can make one sensor look like another (specifically the AcuRite tower vs. leak detector), and the parity doesn't (appear to) cover the message type or sensor unique ID. Though maybe that bears further analysis back in #1953.

Overall, I like the idea of pushing filtering into a downstream application. It seems like a good way to handle separation of concerns (decoding vs filtering). But we would also need to come up with a way to handle device specific knowledge/limitations/edge cases generically.

@qwindelzorf for our specific use case, we already implemented the sanity checks downstream in mqtt4hass with !22. That project needs some sensor-specific knowledge already, so not a huge burden to add the sanity ranges with the rest of the sensor details.

qwindelzorf commented 2 years ago

I would like to propose that the purpose of rtl_433, in this context is "tell me what the sensor said". Where this gets murky is when its not clear what the sensor said, either due to RF interference, or ambiguous decoding. I'm not a fan of embedding data heuristics into the program (as is being done in #1953), but it becomes necessary in that case to be able to tell if the data is what we think it is.

Ideally, those sorts of rules could be put into some shared format (database, library, config file, whatever) and be used as-needed both in rtl_433 and in other downstream programs/projects. That way, there would be one source of truth, but all projects could use the appropriate parts. But I'm not sure what that would look like -- and it seems like a really big lift.

rct commented 2 years ago

@qwindelzorf for our specific use case, we already implemented the sanity checks downstream in mqtt4hass with !22. That project needs some sensor-specific knowledge already, so not a huge burden to add the sanity ranges with the rest of the sensor details.

Mqtt4hass? First I'm seeing this, need to take a look, though that's a separate topic. One of the things that has kept me from going any further with the MQTT auto discovery script is that it really needs both Model specific knowledge and possibly per instance knowlege (name correctly before auto discovery for consistency across environments downstream from hass like Influx.)

zuckschwerdt commented 2 years ago

the purpose of rtl_433, in this context is "tell me what the sensor said".

A clean rule that still allows for minor exceptions. We need some examples what to expect and acceptable exceptions.

We only do the bare minimum of checks to ensure a good reception. We don't filter values on plausibility, as that might vary with the application. We do ensure that the bits are a sensible decode, but not that the values are a sensible representation.

E.g. a temperature can be reported as -40, a consumer can still store and display that and visually at least it indicates an error state. But a raw temperature value of 0xfff could be caught as not a sensible coding and excluded or (depending on confidence in the message integrity) the report aborted.

E.g. a humidity will include 0, which might not be a plausible value, but we do block values > 100 if that helps to reduce false positives (add a code comment to that intention). With the exeption that those limits need to be checked with the hardware, e.g. there is one sensors that reports 110 on range overflow, iirc we map that to 100 as a normalization.

E.g. BCD, when it's tested that the protocol does not flag some extra state with values >= 10 we might block those to reduce bad reception, if the checksum is weak. An exception here is a sensor that report DCF77 time data, we just "glue" all values together without further checks, we don't care what days are valid in a given month, and a seconds values of 60 is legal for a leap-second.

merbanan commented 2 years ago

would like to propose that the purpose of rtl_433, in this context is "tell me what the sensor said".

That is sort of the current modus operandi.

Anyway I would be open to adding an optional parameter verification step. For example if a device is specified to support a specific temperature range then filter out messages that returns values out of that range. For example there was a sensor #1942 that returned bogus values at -0.1. The device could then still be useful in this case as we could filter out the messages that has values out of range.

So the real change to the code would be a parameter verification block that is only active when the parameter option is active. And the only thing that it can do is to reject messages that otherwise looks intact. The question then is should we reject the complete message even if only one parameter is invalid as described by @rct? And I think we should because it is hard to present this granularly to the user. By default we verify the message structure. The proposed option would then broadly also verify the contents.

zuckschwerdt commented 2 years ago

would be a parameter verification block that is only active when the parameter option is active.

That could be a plausibility option similar to verbosity, per decoder. Or/and some field quality like mic that reports if the message should be discarded.

merbanan commented 2 years ago

Ok, so then the decoder would always execute that code and populate this parameter that represent the parameter content integrity check. It could be a ranged parameter and external code could filter out the message.

zuckschwerdt commented 2 years ago

Or also an option to set the minimum acceptable level. With maybe a setting to also include messages with failed mic?

merbanan commented 2 years ago

Maybe if the signal to noise level is not to low.

qwindelzorf commented 2 years ago

would like to propose that the purpose of rtl_433, in this context is "tell me what the sensor said".

That is sort of the current modus operandi.

I mention that this is the primary goal, because it seems like we should go ahead and adopt sanity checking, with the goal of using it to determine if the data we're looking at could be coming from a given sensor. If I have a temperature sensor with an output range of 0-100C, and I get a number that is 321C, odds are I have the wrong decoder. The decoder ought to be able to look at the data range and use that to determine data integrity.

Composite transmitters make this complicated. As mentioned, what do you do if you get a packet with two sensors, and only one of them is insane? Maybe one sensor is fine and the other is busted? Maybe you have the wrong decoder? Maybe the message got stomped in transmission? In all those cases, I would say that the message integrity is suspect. So we should, at the very least, mark the message failing MIC. Whether or not to forward the partially decoded data should probably be left to the different decoder authors to handle, where the decoder logic and/or knowledge of the transmitter could drive that decision.

merbanan commented 2 years ago

One thing is the integrity of the structure of the message and another is the validity of the contained message parameters. If anything is done with regards to the message parameters they should not report a failed mic because that is not what it is. Almost any other error message is fine.

But a point is that is that it is the decoders that should handle this. One easy thing could be checking the temperature range against the manual.

zuckschwerdt commented 2 years ago

If I have a temperature sensor with an output range of 0-100C, and I get a number that is 321C,

It could also indicate that we missed the sign flag, the sensor is actually -10 to 200, and marketing got it wrong in the manual/packaging. E.g. #1966

We likely also don't want to stop the user discovering that something is wrong. It might be easier to spot or alarm on an out of range value than missing values?

gdt commented 11 months ago

I think I land on

quaec commented 8 months ago

I understand both positions for allowing some kind of filtering and for not allowing it.

The issue that I've got, is that rtl_433 is able to send received data via MQTT, but I can't filter it with another tool before it sends the data. If it would be two programs, one for capturing and one for sending the data, then I could put something between for filtering. But at the moment I would have to make a filtering tool that also sends the data via MQTT afterwards, and that's a bit unfortunate.

gdt commented 8 months ago

You can send data with another tool. Look at rtl_433_mqtt_relay in examples and send it json with -F and syslog.

The hard part is getting agreement on what/how, and the two-program approach lets you do what you want and then we'll have some experience.

I basically see built-in mqtt support as unfortunate, vs using an external sender. But that's me.

quaec commented 8 months ago

You can send data with another tool. Look at rtl_433_mqtt_relay in examples and send it json with -F and syslog.

Thanks, that looks quite useful, I'll try it out.

I basically see built-in mqtt support as unfortunate, vs using an external sender. But that's me.

Yeah, on one hand it's nice that mqtt is integrated, on the other hand its not really flexible.