hdtodd / rtl_433_stats

Catalog and analyze devices recorded in rtl_433 JSON logs
BSD 2-Clause "Simplified" License
13 stars 1 forks source link

Add separation by ID #1

Closed gdt closed 1 year ago

gdt commented 2 years ago

Thank you for writing and publishing this. I have logs that are from mosquitto, where the payload is the json from rtl_433, so I should be able to run this.

I find that I sometimes see multiple instances of a model, and I suspect you might too. It would be an interesting enhancement to make the hash key rather than just model-type also include id and channel, when that's reasonably easy to do.

hdtodd commented 2 years ago

Greg, Thanks for the note ... it's rewarding to know that someone else might find the program useful -- and the information it provides is interesting.

I considered using a hash key (to speed up the catalog lookup/insert step) but in the end thought it might not be all that helpful. I hadn't considered it as a way to uniquely identify devices. I'll think a bit more about how that might be used to identify devices, but there seem to be so many variations that no single system seems likely to work.

For example, my Acurite 609TXC remote sensor generates a new ID whenever it is power-cycled: that's not a frequent event, but I couldn't identify it across a year's set of JSON logs. And the JSON logs don't provide a channel number. Similarly, the Markisol remote-control shades in our neighborhood appear transmit a function code in what rtl_433 considers to be the ID field; similar issues for the Oil-SonicStd remote sensor (where it appears the 'ID' field is really used for a sensor reading).

I've attached a summary of devices recently seen by my rtl_433 setup. I don't see a way to consistently catalog devices, uniquely identifying them, based on some combination of model+id+channel. I'll think about this a bit more, but if you see something obvious that would work, please let me know and I'll give it a shot.

By the way, it's pretty easy to prototype a new idea in the Python code, so if you want to try something out, you might want to start there.

Thanks, again, for the note.

David model-id-variations.txt

gdt commented 2 years ago

There are multiple things going on and I think you can only address some in your program. Your program can also be a tool to show data that leads to fixes in rtl_433 for others.

device issues

I see a bunch of issues in devices themselves:

comments

Skimming your list, I see things which are pretty clearly separate devices (TPMS) and some things where id is probably not id.

I don't think you should worry that a device that resets is treated as one instance before and one after. That is really how it is. But in any 24h to 1w dataset, this is usually not a big deal.

To me, the big point is not to merge things together, for counting and especially for SNR stats, that are not the same emitter. Of course, I mean "when it is reasonably feasible not to". So things w/o ids, or which have rolling codes and no ids (car keyfobs) need to be aggregated for stats to make any sense (but really stats cannot make sense for such devices).

Also, as an aside, I see the point of speed, but given that this is experimental and that there are lots of things to change, I would suggest doing python only, because the way it is now means that a proposed change has to be mirrored in both, or there would be behavior divergence.

I didn't really mean hash for speed so much as I meant "construct the unique key by which this device is identified as an individual" for semantics. However, it seems for implementation a dict is the way to go and that's a hash underneath, perhaps of a model-type-id tuple. You can also work around bad ids by skipping them based on model-type, pending fixing rtl_433.

hdtodd commented 2 years ago

Greg,

I don't think you should worry that a device that resets is treated as one instance before and one after. That is really how it is. But in any 24h to 1w dataset, this is usually not a big deal.

Yes, that's where I settled out, too. Can't solve all the problems, particularly when there is no indication from the signaling device that power has cycled and/or what its old and new ids are.

To me, the big point is not to merge things together, for counting and especially for SNR stats, that are not the same emitter.

When I started this, I was using the Acurite 609TXC (which I own) as a model, and the ID field identified individual devices more-or-less uniquely, modulo 256, until an individual device is power-cycled. But once I had the program running and started cataloging devices, it became clear that wasn't going to work in every case. And rather than consolidate differing devices and misrepresent them as one device, I left it with model+ID as the key. In some cases (e.g., Acurite and maybe Oregon-SL109H), that works. In other cases (e.g., Oil-SonicStd) it probably doesn't. Individual users will need to look at the SNR values and decide if it looks like the set of the "models" with different "IDs" are really likely the same device for the devices they see in their neighborhood.

So the catalog generated by snr is helpful but still requires some thought. For example, there are clearly two LaCrosse-TX141THBv2's in my neighborhood:

LaCrosse-TX141THBv2 141      7558   12.4 ±  3.3    5.1   20.0
LaCrosse-TX141THBv2 168      4287    7.6 ±  1.1    4.9   19.9

so the using model+ID worked to distinguish them. On the other hand, there is probably only one, or at most two, Hyundai-VDO's in my neighborhood:

Hyundai-VDO 3f763f07            2   10.4 ±  5.1    6.8   14.0
Hyundai-VDO 3f763f15            1    6.8 ±  0.0    6.8    6.8
Hyundai-VDO acb915e9            2    8.0 ±  1.2    7.2    8.9
Hyundai-VDO acf5cd47            1    9.6 ±  0.0    9.6    9.6
Hyundai-VDO ad070853            1    8.9 ±  0.0    8.9    8.9

I don't think a program could really distinguish based on the information available.

Also, as an aside, I see the point of speed, but given that this is experimental and that there are lots of things to change, I would suggest doing python only, because the way it is now means that a proposed change has to be mirrored in both, or there would be behavior divergence.

I'll probably do prototyping in Python, but update the C code when I find an approach that works. I learned Python over the last few months to do this and a couple of other projects, but my primary (recent) is C and I'm a lot more comfortable with that for production. (My first programming language was Illiac II assembler.)

I didn't really mean hash for speed so much as I meant "construct the unique key by which this device is identified as an individual" for semantics. However, it seems for implementation a dict is the way to go and that's a hash underneath, perhaps of a model-type-id tuple. You can also work around bad ids by skipping them based on model-type, pending fixing rtl_433.

The Python dict approach was pretty slick and made that part of the implementation pretty easy. In the C code, I'll likely just stick with binary tree for catalog storage, though I may implement AVL tree just for the fun of it. The table is so small that a linear search would work and not take much more time.

As you point out, the important part is to figure out what combination of data available in the incoming message can be used to uniquely identify each specific device, and so far, I think the answer is that there is no suitable hashed key that would work across all devices. I don't think that's going to change. Manufacturers are trying to cram as much data as possible into packets of 40-80 bits, including model information, battery status, and checksums. There's no standard for what those packets should look like, and manufacturers would likely not be willing to compromise on a standard if it meant having to throw away some data.

An ugly approach would be to try to have the snr programs create a different hash key for each individual device, based on the device model information. I'm not willing to pursue that, and it would take detailed knowledge across all the devices to build it. So that would be a feasible but unlikely solution.

But the more eyes that look at the incoming data in catalogs like snr produces, the more likely that someone will figure out a combination of fields that would work. And the good news is that it's easy to adapt the code if we do figure it out.

Thanks again for prompting me to think a bit more about this and document what I think I know.

David

gdt commented 2 years ago

I think the Hyundai are different. Those look like TPMS ids and a car will have 4, maybe 5 or those. Maybe that's one car and a stray. For example, I hear 8 Ford ids, 4 for summer tires and 4 for snows.

I agree a program can't distinguish. I meant that the human can figure it out and if rtl_433 puts something in the id field that isn't an id, that's a bug and we should just fix it. When you find that please file an issue.

I know there's no standard format. But rtl_433 separates things out into named fields, and we have a convention that IDs go on the id field. We should fix the code if that's not the case either way. I'm simply suggesting that you add the id field to the key if it exists. And if that turns out not to be good, I say it's a bug in rtl_433, but we can have that discussion when it happens if it isn't obvious.

hdtodd commented 2 years ago

Greg,

The key used is the concatenation of "model" and "id". See line 91 of the C code, for example.

I'll comment it in the code so that it's easy to find and change if anyone wants to experiment.

David

hdtodd commented 1 year ago

Greg,

SNR v2.0.0 branch, now uploaded and renamed to rtl_433_stats, uses model/channel/id as the key now, aligning with other rtl_433 codes. The new version also adds more stats and fixes some de-dup issues, among other things. Check the v2.0.0 README for details.