INSTEON: duplicated messages are processed

hollie / misterhouse

Perl open source home automation program. It's fun, it's free, and it's entirely geeky.

243 stars 130 forks source link

INSTEON: duplicated messages are processed #169

Closed peloy closed 11 years ago

peloy commented 11 years ago

After pressing once a button of a KeypadLinc, MH sees what seem to be multiple duplicate messages from the KeypadLinc, and reacts (sets state) for each received message. The log excerpt below shows the issue; see how set(off) is being called 3 times.

I don't know the protocol well enough but it seems to me like we should look at the hopsleft field and inter-message arrival times to try to determine if a message received from a specific device is a duplicate that reached us through some other devices in the mesh network that just repeated (relayed) the original message from the sending message. The discussion with subject "INSTEON mesh network" in the misterhouse-users mailing list has good analysis by Chris Dragon on this topic.

Avoiding setting state unless the new state is different is probably not a good solution because KeypadLinc buttons can be configured for toggle mode (send 'on' one time the button is pressed, then send 'off' the next time), 'always off' mode (send 'off' every time the button is pressed), and 'always on' mode (send 'on' every time the button is pressed). So, if we do not set the state unless the state changes then it will not be possible to use the 'always on' and 'always off' modes.

04/29/13 10:10:45 AM [Insteon::BaseInterface] DEBUG2: Message received with 2 hops left, delaying next transmit to avoid collisions with remaining hops. 04/29/13 10:10:45 AM [Insteon::BaseInterface] Received message from: $kpl_garage_button_a; command: off; type: alllink; group: 03 04/29/13 10:10:45 AM [Insteon::BaseObject] $kpl_garage_button_a::set(off, $kpl_garage_button_a)

04/29/13 10:10:45 AM [Insteon::BaseInterface] DEBUG2: Message received with 1 hops left, delaying next transmit to avoid collisions with remaining hops. 04/29/13 10:10:45 AM [Insteon::BaseInterface] Received message from: $kpl_garage_button_a; command: off; type: alllink; group: 03 04/29/13 10:10:45 AM [Insteon::BaseObject] $kpl_garage_button_a::set(off, $kpl_garage_button_a)

04/29/13 10:10:46 AM [Insteon::BaseInterface] Received message from: $kpl_garage_button_a; command: off; type: alllink; group: 03 04/29/13 10:10:46 AM [Insteon::BaseObject] $kpl_garage_button_a::set(off, $kpl_garage_button_a)

04/29/13 10:10:46 AM [Insteon::BaseInterface] Received message from: $kpl_garage_button_a; command: off; type: cleanup; group: 03 04/29/13 10:10:46 AM [Insteon::BaseObject] Ignoring Received Direct AllLink Cleanup Message for $kpl_garage_button_a since AllLink Broadcast Message was Received.

04/29/13 10:10:46 AM [Insteon::BaseInterface] DEBUG2: Message received with 2 hops left, delaying next transmit to avoid collisions with remaining hops. 04/29/13 10:10:46 AM [Insteon::BaseInterface] Received message from: $kpl_garage_button_a; command: link_cleanup_report; type: alllink; group: 03 04/29/13 10:10:46 AM [Insteon::BaseObject] $kpl_garage_button_a::set(link_cleanup_report, $kpl_garage_button_a)

I am experiencing this problem with a very recent (i2CS, probably very recent firmware as of April 2013) dual-band KeypadLinc. I do not see this problem with ICON switchs in my network. RemoteLincs and motion sensors apparently send the same message twice in short succession, but there is code in the stack to prevent re-setting the state in these situations, as the following log shows:

04/29/13 11:11:20 AM [Insteon::BaseInterface] DEBUG2: Message received with 2 hops left, delaying next transmit to avoid collisions with remaining hops. 04/29/13 11:11:20 AM [Insteon::BaseInterface] Received message from: $remote_button_4; command: on; type: alllink; group: 04 04/29/13 11:11:20 AM [Insteon::RemoteLinc] $remote_button_4::set_receive(on, $remote_button_4)

04/29/13 11:11:20 AM [Insteon::BaseInterface] Received message from: $remote_button_4; command: on; type: cleanup; group: 04 04/29/13 11:11:20 AM [Insteon::BaseObject] Ignoring Received Direct AllLink Cleanup Message for $remote_button_4 since AllLink Broadcast Message was Received.

dracoventions commented 11 years ago

The longish discussion around this bug is here: http://misterhouse.10964.n7.nabble.com/INSTEON-mesh-network-tt17890.html

After I posted my last comment there I started wondering, why don't I see two duplicate messages on my network? Everything is sent with max hops 1 to start with so I should see one duplicate message when the message is repeated.

I just did a test and got this:

04/29/2013 08:36:31 AM [Insteon_PLM] DEBUG3: Received PLM raw data: 0250211195000001cb1300
04/29/2013 08:36:31 AM [Insteon_PLM] DEBUG4: Milliseconds 690.85
         PLM Command: (0250) insteon_received
            From Address: 21:11:95
              To Address: 00:00:01
           Message Flags: cb
                Message Type: (110) All-Link Broadcast Message
              Message Length: (0) Standard Length
                   Hops Left: 2
                    Max Hops: 3
         Insteon Message: 1300
                       Cmd 1: (13) ALL-Link Alias 1 Low
                       Cmd 2: 00

04/29/2013 08:36:31 AM [Insteon::BaseInterface] DEBUG2: Message received with 2 hops left, delaying next transmit to avoid collisions with remaining hops. 04/29/2013 08:36:31 AM [Insteon::BaseInterface] Received message from: $sl_StairsTop; command: off; type: alllink; group: 01 04/29/2013 08:36:31 AM [Insteon::BaseObject] DEBUG3: Adding hop count of 1 to hop_array of $sl_StairsTop 04/29/2013 08:36:31 AM [Insteon::BaseObject] $sl_StairsTop::set(off, $sl_StairsTop)

Very odd that the first message I see has max hops 3 and that's the only time I see that message. It really seems like my PLM is filtering out retransmissions, so maybe Eloy's PLM is defective after all, or maybe some PLM models don't filter dupes? Mine is a 2413U v1.A dated late 2012 with firmware 9b.

Eloy, what is your PLM model/date/firmware? You should see if there is newer firmware that can be installed on it.

peloy commented 11 years ago

Hi Chris,

On 04/29/2013 11:44 AM, cdragon wrote:

04/29/2013 08:36:31 AM [Insteon::BaseInterface] DEBUG2: Message received with 2 hops left, delaying next transmit to avoid collisions with remaining hops. 04/29/2013 08:36:31 AM [Insteon::BaseInterface] Received message from: $sl_StairsTop; command: off; type: alllink; group: 01 04/29/2013 08:36:31 AM [Insteon::BaseObject] DEBUG3: Adding hop count of 1 to hop_array of $sl_StairsTop 04/29/2013 08:36:31 AM [Insteon::BaseObject] $sl_StairsTop::set(off, $sl_StairsTop)

Very odd that the first message I see has max hops 3 and that's the only time I see that message. It really seems like my PLM is filtering out retransmissions, so maybe Eloy's PLM is defective after all, or maybe some PLM models don't filter dupes? Mine is a 2413U v1.A dated late 2012 with firmware 9b.

Eloy, what is your PLM model/date/firmware? You should see if there is newer firmware that can be installed on it.

My PLM is a 2413U bought in July 2010. Firmware version 92. I would upgrade to a new version but as far as I know the firmware of all SmartHome INSTEON product is not field upgradeable, unfortunately.

Note that I do not see (apparently) duplicate messages for my single-band (powerline) devices... is your sl_StairsTop a dual-band device? I am seeing the issue with a dual-band KeypadLinc. I also have an Access Point in my network, so that's another device that can be relaying messages.

Cheers,

Eloy Paris.-

dracoventions commented 11 years ago

The Switchlinc above that isn't showing duplicate messages at the PLM is a late 2012 dual-band switchlinc. Unfortunately I don't have any dual-band kpls to see if they exhibit the behavior you're seeing.

SmartLinc 2.0 does have field-upgradable firmware, but I'm not sure about the USB PLMs. You should search the internets to see! =) Note that I didn't see duplicate messages when using the SmartLinc either (also firmware 9b).

krkeegan commented 11 years ago

My two cents and a few corrections.

The messages marked as "alllink" are broadcast messages, so there is no ACK for these messages.
Broadcast messages are always sent with a hop count of 3 no matter how capable your network is.
In contrast, direct messages are normally sent with the fewest number of hop_count necessary for the message to be received, this generally prevents duplicate direct messages (but see hop_count comment below).
Up until very recently, MH ignored these messages and waited for the direct message that follows this message. Possibly by adding this feature, we have caused a bug that has otherwise been suppressed. **
To my knowledge, PLMs do not filter out "duplicate" messages. Duplicate messages are very common in direct messages when the hop_count is set too high. I think it is unlikely that the PLM would filter out duplicate broadcast messages by not direct messages.

\ I suspect that this is the real cause of Eloy's problems. I will look at my network tonight and I imagine I will see the same thing.

dracoventions commented 11 years ago

But even with no ACK, the broadcast should be sent a total of 4 times with max hops 3. Page 23 example 3 shows that happening. Even if the message was somehow not being sent 4 times, the fact that max hops is 3 and remaining is 2 indicates it was repeated at least once, but the PLM never sends MH the max hops 3, remaining 3 copy of the message. The PLM has got to be getting that copy because it's transmitted wirelessly from a dual-band switchlinc to a dual-band PLM 15-20 feet away with nothing but air between them.

And even if that particular case can somehow be explained, I've never seen a duplicate message received in all the logs I've looked at, so something is filtering them.

krkeegan commented 11 years ago

Going through my log this morning I found these:

29/04/2013 08:36:58  [Insteon_PLM] DEBUG3: Received PLM raw data: 025014360f000001c71101
29/04/2013 08:36:58  [Insteon_PLM] DEBUG4: Milliseconds 104.52
         PLM Command: (0250) insteon_received
            From Address: 14:36:0f
              To Address: 00:00:01
           Message Flags: c7
                Message Type: (110) All-Link Broadcast Message
              Message Length: (0) Standard Length
                   Hops Left: 1
                    Max Hops: 3
         Insteon Message: 1101
                       Cmd 1: (11) ALL-Link Recall
                       Cmd 2: 01

29/04/2013 08:36:58  [Insteon::BaseInterface] DEBUG2: Message received with 1 hops left, delaying next transmit to avoid collisions with remaining hops.
29/04/2013 08:36:58  [Insteon::BaseInterface] Received message from: $g_st_mt; command: on; type: alllink; group: 01
29/04/2013 08:36:58  [Insteon::BaseObject] DEBUG3: Adding hop count of 2 to hop_array of $g_st_mt
29/04/2013 08:36:58  [Insteon_PLM] DEBUG3: Received PLM raw data: 025014360f000001c31101
29/04/2013 08:36:58  [Insteon_PLM] DEBUG4: Milliseconds 368.54
         PLM Command: (0250) insteon_received
            From Address: 14:36:0f
              To Address: 00:00:01
           Message Flags: c3
                Message Type: (110) All-Link Broadcast Message
              Message Length: (0) Standard Length
                   Hops Left: 0
                    Max Hops: 3
         Insteon Message: 1101
                       Cmd 1: (11) ALL-Link Recall
                       Cmd 2: 01

29/04/2013 08:36:58  [Insteon::BaseInterface] Received message from: $g_st_mt; command: on; type: alllink; group: 01
29/04/2013 08:36:58  [Insteon::BaseObject] DEBUG3: Adding hop count of 3 to hop_array of $g_st_mt

27/03/2013 19:20:29  [Insteon_PLM] DEBUG3: Received PLM raw data: 025014360f000001c71101
27/03/2013 19:20:29  [Insteon_PLM] DEBUG4: Milliseconds 915.75
         PLM Command: (0250) insteon_received
            From Address: 14:36:0f
              To Address: 00:00:01
           Message Flags: c7
                Message Type: (110) All-Link Broadcast Message
              Message Length: (0) Standard Length
                   Hops Left: 1
                    Max Hops: 3
         Insteon Message: 1101
                       Cmd 1: (11) ALL-Link Recall
                       Cmd 2: 01

27/03/2013 19:20:29  [Insteon::BaseInterface] DEBUG2: Message received with 1 hops left, delaying next transmit to avoid collisions with remaining hops.
27/03/2013 19:20:29  [Insteon::BaseInterface] Received message from: $g_st_mt; command: on; type: alllink; group: 01
27/03/2013 19:20:30  [Insteon_PLM] DEBUG3: Received PLM raw data: 025014360f000001c71101
27/03/2013 19:20:30  [Insteon_PLM] DEBUG4: Milliseconds 179.03
         PLM Command: (0250) insteon_received
            From Address: 14:36:0f
              To Address: 00:00:01
           Message Flags: c7
                Message Type: (110) All-Link Broadcast Message
              Message Length: (0) Standard Length
                   Hops Left: 1
                    Max Hops: 3
         Insteon Message: 1101
                       Cmd 1: (11) ALL-Link Recall
                       Cmd 2: 01

27/03/2013 19:20:30  [Insteon::BaseInterface] DEBUG2: Message received with 1 hops left, delaying next transmit to avoid collisions with remaining hops.
27/03/2013 19:20:30  [Insteon::BaseInterface] Received message from: $g_st_mt; command: on; type: alllink; group: 01
27/03/2013 19:20:30  [Insteon::BaseObject] DEBUG3: Adding hop count of 2 to hop_array of $g_st_mt

27/03/2013 22:02:21  [Insteon_PLM] DEBUG3: Received PLM raw data: 02501cb4a0000001cb1100
27/03/2013 22:02:21  [Insteon_PLM] DEBUG4: Milliseconds 647.56
         PLM Command: (0250) insteon_received
            From Address: 1c:b4:a0
              To Address: 00:00:01
           Message Flags: cb
                Message Type: (110) All-Link Broadcast Message
              Message Length: (0) Standard Length
                   Hops Left: 2
                    Max Hops: 3
         Insteon Message: 1100
                       Cmd 1: (11) ALL-Link Recall
                       Cmd 2: 00

27/03/2013 22:02:21  [Insteon::BaseInterface] DEBUG2: Message received with 2 hops left, delaying next transmit to avoid collisions with remaining hops.
27/03/2013 22:02:21  [Insteon::BaseInterface] Received message from: $s_ma_room_lt_ma; command: on; type: alllink; group: 01
27/03/2013 22:02:21  [Insteon::BaseObject] DEBUG3: Adding hop count of 1 to hop_array of $s_ma_room_lt_ma
27/03/2013 22:02:21  [Insteon_PLM] DEBUG3: Received PLM raw data: 02501cb4a0000001c31100
27/03/2013 22:02:21  [Insteon_PLM] DEBUG4: Milliseconds 909.96
         PLM Command: (0250) insteon_received
            From Address: 1c:b4:a0
              To Address: 00:00:01
           Message Flags: c3
                Message Type: (110) All-Link Broadcast Message
              Message Length: (0) Standard Length
                   Hops Left: 0
                    Max Hops: 3
         Insteon Message: 1100
                       Cmd 1: (11) ALL-Link Recall
                       Cmd 2: 00

27/03/2013 22:02:21  [Insteon::BaseInterface] Received message from: $s_ma_room_lt_ma; command: on; type: alllink; group: 01
27/03/2013 22:02:21  [Insteon::BaseObject] DEBUG3: Adding hop count of 3 to hop_array of $s_ma_room_lt_ma

28/03/2013 07:56:12  [Insteon_PLM] DEBUG3: Received PLM raw data: 02501c9deb000001cb1100
28/03/2013 07:56:12  [Insteon_PLM] DEBUG4: Milliseconds 698.35
         PLM Command: (0250) insteon_received
            From Address: 1c:9d:eb
              To Address: 00:00:01
           Message Flags: cb
                Message Type: (110) All-Link Broadcast Message
              Message Length: (0) Standard Length
                   Hops Left: 2
                    Max Hops: 3
         Insteon Message: 1100
                       Cmd 1: (11) ALL-Link Recall
                       Cmd 2: 00

28/03/2013 07:56:12  [Insteon::BaseInterface] DEBUG2: Message received with 2 hops left, delaying next transmit to avoid collisions with remaining hops.
28/03/2013 07:56:12  [Insteon::BaseInterface] Received message from: $f_al_lt_ma; command: on; type: alllink; group: 01
28/03/2013 07:56:12  [Insteon::BaseObject] DEBUG3: Adding hop count of 1 to hop_array of $f_al_lt_ma
28/03/2013 07:56:13  [Insteon_PLM] DEBUG3: Received PLM raw data: 02501c9deb000001c31100
28/03/2013 07:56:13  [Insteon_PLM] DEBUG4: Milliseconds 014.45
         PLM Command: (0250) insteon_received
            From Address: 1c:9d:eb
              To Address: 00:00:01
           Message Flags: c3
                Message Type: (110) All-Link Broadcast Message
              Message Length: (0) Standard Length
                   Hops Left: 0
                    Max Hops: 3
         Insteon Message: 1100
                       Cmd 1: (11) ALL-Link Recall
                       Cmd 2: 00

28/03/2013 07:56:13  [Insteon::BaseInterface] Received message from: $f_al_lt_ma; command: on; type: alllink; group: 01
28/03/2013 07:56:13  [Insteon::BaseObject] DEBUG3: Adding hop count of 3 to hop_array of $f_al_lt_ma

I could probably find more. Although I appear to have fewer than Eloy does.

That said, I do have a dual band device right next to the PLM, and I am unable to find duplicate messages from that device in my log. I can hammer it tonight and see if I can get any.

My PLM is also rather new firmware = 9B.

Moreover, like I said, I know I had a tremendous problem with duplicate messages when I worked on the hop_count problem. They would often cause errors in the ALDB link database stored by MH. These would result in bytes being repeated twice, for example an address 20.f5.10 would appear as 20.f5.f5. Many people reported seeing these as well, so I am relatively confident duplicate messages were a wide spread problem.

The PLM may have some filtering capability, but if it does, it isn't perfect.

dracoventions commented 11 years ago

I think you're right that the PLM tries to filter duplicates but doesn't always succeed. Even in your examples that show duplicates, there's a number of duplicates that got filtered, like where you see Hops left: 2 then Hops left: 0. The Hops left: 1 message isn't reported to MH but the PLM must have received it since all devices that have heard the message in the past will repeat the message again in each time slot with hops left - 1 until they reach hops left 0.

Your second example is extremely strange since both copies of the message have hops of 3/1. Since the messages are potentially one second apart according to the time stamp, the second one would likely be a completely new broadcast of the same message starting with hops 3/3, 3/2, and the PLM only reports the 3/1 version. But since it's a broadcast message, the original device shouldn't have reason to transmit it again unless maybe the button on the device was double-pressed quickly.

krkeegan commented 11 years ago

It looks like the Insteon White Paper was updated recently, it may only be a cosmetic thing, i certainly see a few new diagrams. Maybe a more astute observer can determine what is new: http://www.insteon.com/pdf/insteondetails.pdf

As to multiple messages arriving with the same hop counts. We have seen this before in working on the ALDB scan. Our assumption is that the bridge between the wired and rf signal is the cause of this. It is believed that at least one of the directions wired->rf or rf->wired does not cause a decrease in the hop count.

Additionally, Insteon relies on the powerline carrier frequency for message timing and synchronization. Since most homes have two-phase electricity, rf repeating and phase bridging creates the very real possibility of two different message paths forming.

UPDATE: The likelihood for duplicate messages paths increases as the hop count increases.

My only concern with adding a filter function, is that the message timing in MH has never matched what Insteon claims it should be. By this I mean that the Insteon whitepaper quotes timings for how long it may take to receive an acknowledgement based on the number of hop counts. Unfortunately, this doesn't match real world experience. Possibly a delay in the PLM, the serial connection, MH, or in the end device itself causes some level of variability in how long it takes for messages to be acknowledged. This can be upwards of 3 or more seconds.

peloy commented 11 years ago

On 04/30/2013 02:55 PM, krkeegan wrote:

[...]

My only concern with adding a filter function, is that the message timing in MH has never matched what Insteon claims it should be. By this I mean that the Insteon whitepaper quotes timings for how long it may take to receive an acknowledgement based on the number of hop counts. Unfortunately, this doesn't match real world experience. Possibly a delay in the PLM, the serial connection, MH, or in the end device itself causes some level of variability in how long it takes for messages to be acknowledged. This can be upwards of 3 or more seconds.

Perhaps this is the reason the PLM cannot successfully filter all duplicate messages? I know -- without knowing exactly what the PLM does (whether it filters, and how), it's hard to know.

Anyway, if trying to follow the theoretical INSTEON timing as a means to detect duplicates is too unreliable/would not work in practice, what other means do we have to detect duplicates? Two packets might be the same (and therefore one a duplicate of the other) if:

Same From address
Same To address
Same message structure (type, flags, etc.)
Same Max Hops
Same Cmd1 and Cmd2 (both standard and extended)
Same User Data 1 .. User Data 14 (extended)

Of course, it's possible for a device to generate two identical messages in short succession (if fact, I believe some devices like motion sensors and RemoteLincs do that precisely), so that's another case to take into consideration (we already do for motion sensors and RemoteLincs, i.e. "set_receive(off, $upstairs_motion) deferred due to repeat within 1 second").

It'd be a lot easier if the protocol designers had included an ID, similar to the IP ID field, or TCP sequence numbers, but alas, we don't have that...

Now for the differentiators... the number of hops left, of course, and timing differences. Is there anything else?

If we treat as a duplicate a message that meets criteria (1) through (6) above, that has a certain hops left value (I don't know the exact criteria for this value, but Chris has made an analysis), and that was received within a short time window after receiving another, would that be too crude of an approach to detect duplicate messages?

Cheers,

Eloy Paris.-

P.S. Sorry, all this is probably obvious to the INSTEON experts here; I was just brainstorming aloud ;-) But feel free to keep me honest if I said anything that is not accurate.

dracoventions commented 11 years ago

What if you put in a configuration param that defaults to 1000ms and have MH ignore messages that are identical in all values except the hops remaining and the D14 CRC byte that appear within that timeout value? For the rare installations where a dupe message takes 3 seconds to appear, they could set the value to 3000ms, but I think 1000ms would take care of the majority of installations and wouldn't generally interfere with people who actually want to say turn a light on twice to make it do something different (they would just have to wait at least a second between the two times they turned it on).

Are there any Insteon devices that actually send two valid identical messages in short succession where it wouldn't hurt to drop one of them? Maybe dim/brighten? Of course you could exclude those special cases.

krkeegan commented 11 years ago

I would be willing to take a stab at it. As I mentioned duplicate messages are still sometimes the cause of some aldb corruption, so eliminating them has some real benefits.

My idea would be to make this really simple.

Place the filter way down in the stack, someplace just after the PLM message parser
Filter only insteon_received messages (both standard and extended)
I would create a hash wherein:
- The key is the message itself, without the message flags or the CRC byte (thanks chris)
- The value is the result of (the current milliseconds + ( hops_left * some_value)) wherein some_value is the number of milliseconds required for each hop (adjusted for standard or extended messages). We can dial this number in to account for phase shifts and whatever else.
As each message is received, a quick check is made to see if it exists in the hash. If it does, and the alloted time has not been exceeded, the message would be filtered.
Some garbage cleanup routine on the hash to dump old entries

As I think about it, my 3 second scenario is way off, it deals with a full round trip. We shouldn't see that much variation here as the duplication is happening in transit (at least we think it is). This should be less than a second.

I see chris beat me to it with pretty much the same design ideas.

dracoventions commented 11 years ago

On second thought, D14 CRC is based on Cmd1, Cmd2, and D1-13, so it shouldn't change as flags change and thus it should be identical in two duplicate messages.

I think you should also only ignore bits 2-3 (hops remaining) in the flags because a duplicate message would have the same max hops value, NAK/ACK, etc that the other bits describe.

krkeegan commented 11 years ago

At 60Hz there are 120 crossings in a second. Meaning each crossing is 8.333 milliseconds
A Standard Message Requires 6 Zero Crossings = 50 Milliseconds
An Extended Message Requires 13 Zero Crossings = 108.333 Milliseconds
Both of these include time for the rf transmission
This matches what is described in the Insteon spec.

We should add a small fudge factor for the PLM. Theoretically half a second should be more than enough to catch duplicate messages. Real world testing may vary.

Chris: Good call on the CRC. As to the other flags. I was going to be lazy and just ignore the other bits (ACK, Type...) since it just makes life easier (I can just drop a byte rather than get into bitwise calculations). In this short of time, what are the odds that the rest of the message would be otherwise identical?

\ As a side note, my two phase theory may no be correct, since messages are sent at zero crossings. But at this point I am really exceeding my knowledge.

dracoventions commented 11 years ago

Kevin, You can simply include ($flagsByte & 0xF3) which is a "logical and" that will include the value in $flagsByte with bits 2-3 always set to 0. I'm not sure offhand if a non-duplicate message could ever be sent that was identical in all ways except the flags byte, but it's so easy to include the flags byte minus bits 2-3, why take the risk?

I'm not sure either if the zero crossings on each of the two phases of the powerline are always identicaly timed. I think they're meant to be but in the real world maybe not in all installations..

krkeegan commented 11 years ago

OK, its done, and it seems to work:

30/04/2013 16:55:27  [Insteon_PLM] DEBUG3: Received PLM raw data: 02500d739c20f5f5212b20
30/04/2013 16:55:27  [Insteon_PLM] DEBUG4: Milliseconds 226.79
         PLM Command: (0250) insteon_received
            From Address: 0d:73:9c
              To Address: 20:f5:f5
           Message Flags: 21
                Message Type: (001) ACK of Direct Message
              Message Length: (0) Standard Length
                   Hops Left: 0
                    Max Hops: 1
         Insteon Message: 2b20
                       Cmd 1: (2b) Peek One Byte(Deprecated)
                       Cmd 2: (20) Peeked Byte

30/04/2013 16:55:27  [Insteon::BaseInterface] DEBUG3: PLM command:insteon_received; Device command:peek; type:direct; group: 
30/04/2013 16:55:27  [Insteon::BaseObject] DEBUG3: Adding hop count of 1 to hop_array of $s_la_lt_s1
30/04/2013 16:55:27  [Insteon::ALDB_i1] DEBUG3: $s_la_lt_s1 [0x0FCA] received: 20 for aldb_devhi
30/04/2013 16:55:27  [Insteon_PLM] DEBUG3: Received PLM raw data: 02500d739c20f5f5212b20
30/04/2013 16:55:27  [Insteon_PLM] DEBUG4: Milliseconds 283.29
         PLM Command: (0250) insteon_received
            From Address: 0d:73:9c
              To Address: 20:f5:f5
           Message Flags: 21
                Message Type: (001) ACK of Direct Message
              Message Length: (0) Standard Length
                   Hops Left: 0
                    Max Hops: 1
         Insteon Message: 2b20
                       Cmd 1: (2b) Peek One Byte(Deprecated)
                       Cmd 2: (20) Peeked Byte

30/04/2013 16:55:27  [Insteon::BaseInterface] WARN! Dropped duplicate incoming message 0d739c20f5f5212b20

30/04/2013 16:55:34  [Insteon_PLM] DEBUG3: Received PLM raw data: 02500d11f420f5f521280f
30/04/2013 16:55:34  [Insteon_PLM] DEBUG4: Milliseconds 005.35
         PLM Command: (0250) insteon_received
            From Address: 0d:11:f4
              To Address: 20:f5:f5
           Message Flags: 21
                Message Type: (001) ACK of Direct Message
              Message Length: (0) Standard Length
                   Hops Left: 0
                    Max Hops: 1
         Insteon Message: 280f
                       Cmd 1: (28) Set Address MSB(Deprecated)
                       Cmd 2: (0f) MSB

30/04/2013 16:55:34  [Insteon::BaseInterface] DEBUG3: PLM command:insteon_received; Device command:set_address_msb; type:direct; group: 
30/04/2013 16:55:34  [Insteon::BaseObject] DEBUG3: Adding hop count of 1 to hop_array of $s_ch_lt_s1
30/04/2013 16:55:34  [Insteon_PLM] DEBUG3: Received PLM raw data: 02500d11f420f5f521280f
30/04/2013 16:55:34  [Insteon_PLM] DEBUG4: Milliseconds 061.56
         PLM Command: (0250) insteon_received
            From Address: 0d:11:f4
              To Address: 20:f5:f5
           Message Flags: 21
                Message Type: (001) ACK of Direct Message
              Message Length: (0) Standard Length
                   Hops Left: 0
                    Max Hops: 1
         Insteon Message: 280f
                       Cmd 1: (28) Set Address MSB(Deprecated)
                       Cmd 2: (0f) MSB

30/04/2013 16:55:34  [Insteon::BaseInterface] WARN! Dropped duplicate incoming message 0d11f420f5f521280f

30/04/2013 16:55:46  [Insteon_PLM] DEBUG3: Received PLM raw data: 0250099bf420f5f5212b010250099bf4
30/04/2013 16:55:46  [Insteon_PLM] DEBUG4: Milliseconds 434.20
         PLM Command: (0250) insteon_received
            From Address: 09:9b:f4
              To Address: 20:f5:f5
           Message Flags: 21
                Message Type: (001) ACK of Direct Message
              Message Length: (0) Standard Length
                   Hops Left: 0
                    Max Hops: 1
         Insteon Message: 2b01
                       Cmd 1: (2b) Peek One Byte(Deprecated)
                       Cmd 2: (01) Peeked Byte

30/04/2013 16:55:46  [Insteon::BaseInterface] DEBUG3: PLM command:insteon_received; Device command:peek; type:direct; group: 
30/04/2013 16:55:46  [Insteon::BaseObject] DEBUG3: Adding hop count of 1 to hop_array of $s_st_lt_s1
30/04/2013 16:55:46  [Insteon::ALDB_i1] DEBUG3: $s_st_lt_s1 [0x0FF7] received: 01 for aldb_data3
30/04/2013 16:55:46  [Insteon::ALDB_i1] $s_st_lt_s1 accessing memory at location: 0x0FE8
30/04/2013 16:55:46  [Insteon_PLM] DEBUG3: Saving parsed data fragment: 0250099bf4
30/04/2013 16:55:46  [Insteon_PLM] DEBUG3: Prepending prior data fragment: 0250099bf4
30/04/2013 16:55:46  [Insteon_PLM] DEBUG3: Received PLM raw data: 0250099bf420f5f5212b01
30/04/2013 16:55:46  [Insteon_PLM] DEBUG4: Milliseconds 491.13
         PLM Command: (0250) insteon_received
            From Address: 09:9b:f4
              To Address: 20:f5:f5
           Message Flags: 21
                Message Type: (001) ACK of Direct Message
              Message Length: (0) Standard Length
                   Hops Left: 0
                    Max Hops: 1
         Insteon Message: 2b01
                       Cmd 1: (2b) Peek One Byte(Deprecated)
                       Cmd 2: (01) Peeked Byte

30/04/2013 16:55:46  [Insteon::BaseInterface] WARN! Dropped duplicate incoming message 099bf420f5f5212b01

I originally tried just hard coding 500 milliseconds as the "delay." However, that caught legitimate duplicate messages. For example my PLM id is 20.f5.f5, a number of legitimate instances of the second f5 were caught by this.

So it had to become a little more sophisticated, and now the delay is based on the number of hops left received, plus some extra wiggle room. So far I have no more false positives.

There is one small glitch with this design. If a messages becomes stacked up in the receiving queue, the received time is calculated from when this sub is run, not when the message is actually received by the PLM. This could permit some duplicates to sneak through.

I will push the commit in a second, Eloy in particular, please test it and let me know if it catches your KPL problem.

krkeegan commented 11 years ago

OK, the branch is:

https://github.com/krkeegan/misterhouse/tree/fix_issue_169

It was forked off of the merged_i2 branch. Eloy if you need me to make you a merged branch for you to test just let me know what needs to be in the merge.

peloy commented 11 years ago

Hi Kevin,

On 04/30/2013 08:59 PM, krkeegan wrote:

OK, its done, and it seems to work:

[...]

30/04/2013 16:55:46 [Insteon::BaseInterface] WARN! Dropped duplicate incoming message 099bf420f5f5212b01 |

I originally tried just hard coding 500 milliseconds as the "delay." However, that caught legitimate duplicate messages. For example my PLM id is 20.f5.f5, a number of legitimate instances of the second f5 were caught by this.

So it had to become a little more sophisticated, and now the delay is based on the number of hops left received, plus some extra wiggle room. So far I have no more false positives.

There is one small glitch with this design. If a messages becomes stacked up in the receiving queue, the received time is calculated from when this sub is run, not when the message is actually received by the PLM. This could permit some duplicates to sneak through.

I will push the commit in a second, Eloy in particular, please test it and let me know if it catches your KPL problem.

Wow, that was fast; thank you!

I am running with fix_issue_169, and so far I have not seen a device toggle more than once after pushing once a KPL button (before your fix for #169 I would push a button and the button would change state a couple of times in about one second). I have seen a few "WARN! Dropped duplicate incoming message" messages in the log, but nothing that I have noticed causing a negative effect.

I'll keep running this and report back if I notice any side effects. I'm also re-enabling the user code to toggle the garage door on KPL button press. I had disabled this code because of this problem.

Cheers,

Eloy Paris.-

peloy commented 11 years ago

Hi Kevin,

On 04/30/2013 09:06 PM, krkeegan wrote:

OK, the branch is:

https://github.com/krkeegan/misterhouse/tree/fix_issue_169

It was forked off of the merged_i2 branch. Eloy if you need me to make you a merged branch for you to test just let me know what needs to be in the merge.

I'm good; thanks! I just applied fix_issue_169 (plus fix_issue_17, fix_issue_83, and fix_issue_168) to Michael's merge_i2_aldb_support branch and that's what I am currently running.

Cheers,

Eloy Paris.-

krkeegan commented 11 years ago

This is still working fine for me. It has caught a small handful of legitimately duplicate messages in the past 6 days. It has had zero false positives. I have not notices any duplicate messages that have crept through, but they are harder to find.

Eloy, since you were having the issue, how is it working for you?

dracoventions commented 11 years ago

I started getting duplicate message problems either after I turned on the wireless thermostat or maybe as a result of switching to merge_i2_aldb_support. So I installed this fix but it didn't eliminate the duplicate messages. I noticed the 0xF3 bitmask wasn't being applied correctly so it was effectively setting the whole flags byte to 0, but I don't think that had to do with not detecting duplicates. I found an example of a standard message with 2 hops remaining followed by the same message with 0 hops remaining 250ms later which is longer than the 200ms timeout allowed. That might have to do with additional delays caused by running on Raspberry Pi - a slow mini computer.

After adding 500ms to the timeout and fixing the 0xF3 flags bitmask, I also found a case where I got a message with 2 max hops followed by the same message with 3 max hops, so I changed the flags bitmask to 0xF0 to ignore max hops as well. Here is the modified code:

    # $key will be set to $message_data with max hops and hops left set to 0
    my $key = $message_data;
    substr($key,12,2) = unpack("H2",chr(hex(substr($key,12,2)) & 0xF0));

    #How long does it take to transmit each hop of this message
    my $message_time = (length($message_data) > 18) ? 108 : 50;

    #Get hops left
    my $hops_left = hex(substr($message_data,13,1)) >> 2;

    #Add additional delay to catch PLM delays or out of sync messages
    my $delay = ($message_time * ($hops_left+2)) + 500;
    #Clean hash of outdated entries
    for (keys $$self{received_commands}){
        if ($$self{received_commands}{$_} < $curr_milli){
            delete($$self{received_commands}{$_});
        }
    }
    #Check if the message exists
    if (exists($$self{received_commands}{$key})){
        $is_duplicate = 1;
        ::print_log("[Insteon::BaseInterface] WARN! Dropped duplicate incoming message "
            . $message_data . " at $curr_milli") if $main::Debug{insteon};
    } else {
        #Message was not in hash, so add it
        main::print_log("Adding dupe message check key '$key' at $curr_milli + $delay = " . ($curr_milli + $delay) )
            if $main::Debug{insteon};
        $$self{received_commands}{$key} = $curr_milli + $delay;
    }
    return $is_duplicate;

Here's a log of duplicate message with hops left - 2 after 250 ms:

05/03/2013 06:03:48 PM [Insteon_PLM] DEBUG3: Received PLM raw data: 025022ae3a000001cb1100
05/03/2013 06:03:48 PM Adding dupe message check key '22ae3a000001c31100' at 1367629428749 + 200 = 1367629428949
05/03/2013 06:03:48 PM [Insteon::BaseInterface] DEBUG2: Message received with 2 hops left, delaying next transmit to avoid collisions with remaining hops.
05/03/2013 06:03:48 PM [Insteon::BaseInterface] Received message from: $sl_WorkroomBack; command: on; type: alllink; group: 01
05/03/2013 06:03:48 PM [Insteon::BaseObject] DEBUG3: Adding hop count of 1 to hop_array of $sl_WorkroomBack
05/03/2013 06:03:48 PM [Insteon::BaseObject] $sl_WorkroomBack::set(on, $sl_WorkroomBack)
05/03/2013 06:03:48 PM [Insteon_PLM] x10 sending code: B1 as insteon msg: e600
05/03/2013 06:03:48 PM [Insteon_PLM] x10 sending code: BK as insteon msg: e380
05/03/2013 06:03:48 PM [Insteon_PLM] Queueing message 'e380'.  Command stack size: 1, active message 'none'
05/03/2013 06:03:48 PM transmit_in_progress returning xmit_in_progress=0 || not _check_timeout('xmit')=1
05/03/2013 06:03:48 PM Aborting process_queue because transmit_in_progress
05/03/2013 06:03:48 PM  ***  Workroom back switch turned on when on, so toggle lamps,  $workroomLamps=0 (1=WorkroomSewingLamp, 2=WorkroomBackLamp)
05/03/2013 06:03:48 PM [Insteon_PLM] DEBUG3: Received PLM raw data: 025022ae3a000001c31100
# This message received at 1367629429001 and duplicate copy received 250ms earlier at 1367629428749
05/03/2013 06:03:48 PM Adding dupe message check key '22ae3a000001c31100' at 1367629429001 + 100 = 1367629429101

And a log with max hops 2 then 3:

05/06/2013 05:31:37 PM [Insteon_PLM] x10 sending code: B1 as insteon msg: e600
05/06/2013 05:31:37 PM [Insteon_PLM] x10 sending code: BK as insteon msg: e380
05/06/2013 05:31:37 PM process_queue: Attempting to send 'e600'
05/06/2013 05:31:37 PM [Insteon_PLM] DEBUG2: Sending e600 incurred delay of 0.01 seconds; starting hop-count: ?
05/06/2013 05:31:37 PM [Insteon_PLM] DEBUG3: Sending  PLM raw data: 0263e600
05/06/2013 05:31:37 PM  ***  Workroom back switch turned on when on, so toggle lamps,  $workroomLamps=0 (1=WorkroomSewingLamp, 2=WorkroomBackLamp)
05/06/2013 05:31:37 PM [Insteon_PLM] DEBUG3: Received PLM raw data: 0263e60006
05/06/2013 05:31:37 PM [Insteon_PLM] DEBUG3: Received PLM acknowledge: e600
05/06/2013 05:31:38 PM process_queue: Attempting to send 'e380'
05/06/2013 05:31:38 PM [Insteon_PLM] DEBUG2: Sending e380 incurred delay of 0.55 seconds; starting hop-count: ?
05/06/2013 05:31:38 PM [Insteon_PLM] DEBUG3: Sending  PLM raw data: 0263e380
05/06/2013 05:31:38 PM [Insteon_PLM] DEBUG3: Received PLM raw data: 1515
05/06/2013 05:31:38 PM [Insteon_PLM] DEBUG3: Interface extremely busy. Resending command after delaying for 1 second
# Max hops 2 cleanup message:
05/06/2013 05:31:38 PM [Insteon_PLM] DEBUG3: Received PLM raw data: 025022ae3a20f4a94b1101
05/06/2013 05:31:38 PM Adding dupe message check key '22ae3a20f4a9431101' at 1367886698682 + 700 = 1367886699382
05/06/2013 05:31:38 PM [Insteon::BaseInterface] DEBUG2: Message received with 2 hops left, delaying next transmit to avoid collisions with remaining hops.
05/06/2013 05:31:38 PM [Insteon::BaseInterface] Received message from: $sl_WorkroomBack; command: on; type: cleanup; group: 01
05/06/2013 05:31:38 PM [Insteon::BaseObject] DEBUG3: Adding hop count of 1 to hop_array of $sl_WorkroomBack
05/06/2013 05:31:38 PM [Insteon::BaseObject] Ignoring Received Direct AllLink Cleanup Message for $sl_WorkroomBack since AllLink Broadcast Message was Received.
05/06/2013 05:31:38 PM process_queue: Attempting to send 'e380'
05/06/2013 05:31:38 PM [Insteon::BaseMessage] WARN: now resending e380 after 1 attempts.
05/06/2013 05:31:38 PM [Insteon_PLM] DEBUG2: Sending e380 incurred delay of 1.09 seconds; starting hop-count: ?
05/06/2013 05:31:38 PM [Insteon_PLM] DEBUG3: Sending  PLM raw data: 0263e380
05/06/2013 05:31:39 PM [Insteon_PLM] DEBUG3: Received PLM raw data: 0263e38006
05/06/2013 05:31:39 PM [Insteon_PLM] DEBUG3: Received PLM acknowledge: e380
05/06/2013 05:31:39 PM process_queue: No pending message.  Clear timer, return 0
05/06/2013 05:32:39 PM [Insteon_PLM] DEBUG3: Received PLM raw data: 025022ae3a000001c71100
05/06/2013 05:32:39 PM Adding dupe message check key '22ae3a000001c31100' at 1367886759099 + 650 = 1367886759749
05/06/2013 05:32:39 PM [Insteon::BaseInterface] DEBUG2: Message received with 1 hops left, delaying next transmit to avoid collisions with remaining hops.
05/06/2013 05:32:39 PM [Insteon::BaseInterface] Received message from: $sl_WorkroomBack; command: on; type: alllink; group: 01
05/06/2013 05:32:39 PM [Insteon::BaseObject] DEBUG3: Adding hop count of 2 to hop_array of $sl_WorkroomBack
05/06/2013 05:32:39 PM [Insteon::BaseObject] $sl_WorkroomBack::set(on, $sl_WorkroomBack)
05/06/2013 05:32:39 PM [Insteon_PLM] DEBUG3: Received PLM raw data: 025022ae3a000001cb1100
05/06/2013 05:32:39 PM [Insteon::BaseInterface] WARN! Dropped duplicate incoming message 22ae3a000001cb1100 at 1367886759231
05/06/2013 05:32:39 PM process_queue: No pending message.  Clear timer, return 0
05/06/2013 05:32:39 PM [Insteon_PLM] x10 sending code: B1 as insteon msg: e600
05/06/2013 05:32:39 PM [Insteon_PLM] x10 sending code: BJ as insteon msg: e280
05/06/2013 05:32:39 PM process_queue: Attempting to send 'e600'
05/06/2013 05:32:39 PM [Insteon_PLM] DEBUG2: Sending e600 incurred delay of 0.00 seconds; starting hop-count: ?
05/06/2013 05:32:39 PM [Insteon_PLM] DEBUG3: Sending  PLM raw data: 0263e600
05/06/2013 05:32:39 PM  ***  Workroom back switch turned on when on, so toggle lamps,  $workroomLamps=1 (1=WorkroomSewingLamp, 2=WorkroomBackLamp)
05/06/2013 05:32:39 PM [Insteon_PLM] DEBUG3: Received PLM raw data: 0263e60006
05/06/2013 05:32:39 PM [Insteon_PLM] DEBUG3: Received PLM acknowledge: e600
05/06/2013 05:32:39 PM process_queue: Attempting to send 'e280'
05/06/2013 05:32:39 PM [Insteon_PLM] DEBUG2: Sending e280 incurred delay of 0.55 seconds; starting hop-count: ?
05/06/2013 05:32:39 PM [Insteon_PLM] DEBUG3: Sending  PLM raw data: 0263e280
# Got cleanup with max hops 2 above, then same cleanup with max hops 3 below
05/06/2013 05:32:39 PM [Insteon_PLM] DEBUG3: Received PLM raw data: 025022ae3a20f4a94611010263e28006
05/06/2013 05:32:39 PM [Insteon_PLM] DEBUG3: Received PLM acknowledge: e280
05/06/2013 05:32:39 PM Adding dupe message check key '22ae3a20f4a9421101' at 1367886759959 + 650 = 1367886760609
05/06/2013 05:32:39 PM [Insteon::BaseInterface] DEBUG2: Message received with 1 hops left, delaying next transmit to avoid collisions with remaining hops.
05/06/2013 05:32:39 PM [Insteon::BaseInterface] Received message from: $sl_WorkroomBack; command: on; type: cleanup; group: 01

peloy commented 11 years ago

On 05/06/2013 09:44 PM, cdragon wrote:

I started getting duplicate message problems either after I turned on the wireless thermostat or maybe as a result of switching to merge_i2_aldb_support.

Heh, sorry to hear that :-(

Were the duplicate messages causing you some trouble? In my case they were because they were causing the relay that toggles my garage door to toggle twice per button press, instead of once.

So I installed this fix but it didn't eliminate the duplicate messages. I noticed the 0xF3 bitmask wasn't being applied correctly so it was effectively setting the whole flags byte to 0, but I don't think that had to do with not detecting duplicates. I found an example of a standard message with 2 hops remaining followed by the same message with 0 hops remaining 250ms later which is longer than the 200ms timeout allowed. That might have to do with additional delays caused by running on Raspberry Pi - a slow mini computer.

After adding 500ms to the timeout and fixing the 0xF3 flags bitmask, I also found a case where I got a message with 2 max hops followed by the same message with 3 max hops, so I changed the flags bitmask to 0xF0 to ignore max hops as well. Here is the modified code:

[...]

Did the modified code improve things? I am running with Kevin's original fix, with no modification, and I did notice yesterday that one button press on the KeypadLinc cause the garage door relay to toggle twice, which means that MH processed a duplicate message. I forgot to look at the log file at the time... is there anything I should capture next time that this happens?

Cheers,

Eloy Paris.-

peloy commented 11 years ago

Hi Kevin,

On 05/06/2013 07:16 PM, krkeegan wrote:

This is still working fine for me. It has caught a small handful of legitimately duplicate messages in the past 6 days. It has had zero false positives. I have not notices any duplicate messages that have crept through, but they are harder to find.

Eloy, since you were having the issue, how is it working for you?

It is working pretty well. I noticed yesterday that one duplicate made it in, which caused my garage door relay to toggle twice in response to a single KPL button press. But this seems to occur much less frequently after I applied your fix for this issue.

Is there anything I should capture when I notice what looks like a duplicate?

Cheers,

Eloy Paris.-

dracoventions commented 11 years ago

Eloy,

No problems so far with the above fix. Kevin said originally that he tried a 500ms delay and it was causing legitimate duplicate messages to be discarded, so my fix may cause that problem as well. I didn't understand what kind of legitimate messages should be sent twice in under 500ms though. I'm still worried about bright/dim commands but I don't have any dimming devices to test that on.

Your occasional problems are likely due to what I was seeing with a duplicate message appearing in 250ms which the original timeout was not large enough to catch.

I have some code I have that toggles an X10 relay when you press On on an insteon fixture that's already on. Without duplicate message filtering, the X10 relay is told to toggle on, then off, then on, etc for each duplicate insteon "on" message received.

peloy commented 11 years ago

On 05/07/2013 12:11 AM, cdragon wrote:

Eloy,

No problems so far with the above fix. Kevin said originally that he tried a 500ms delay and it was causing legitimate duplicate messages to be discarded, so my fix may cause that problem as well. I didn't understand what kind of legitimate messages should be sent twice in under 500ms though. I'm still worried about bright/dim commands but I don't have any dimming devices to test that on.

Your occasional problems are likely due to what I was seeing with a duplicate message appearing in 250ms which the original timeout was not large enough to catch.

Okay, cool. Let's see what Kevin things about your modifications.

I have some code I have that toggles an X10 relay when you press On on an insteon fixture that's already on. Without duplicate message filtering, the X10 relay is told to toggle on, then off, then on, etc for each duplicate insteon "on" message received.

Yup, that's the exact same problem I was having/am having very occasionally now. If you press a button and that button sends an on or off message that gets interpreted by the receiving devices as on or off then it does not matter if you have duplicates, but when the "on" message causes the receiving device to toggle something, like your case and mine, then duplicates are obviously a problem.

Cheers,

Eloy Paris.-

krkeegan commented 11 years ago

I am fine with zeroing out the max hops as well. There are likely instances in which the PLM ACK does not reach a device in time causing a retry with an increased hop count. I think it is reasonable that we suppress those duplicates as well.

The increased "delay" causes major issues for me though, depending on the message the total allowed delay is upwards of 800 milliseconds now. When I ran scan all link tables it gobbled up the following legitimate messages:

(15 instances) of a standard length peek message. These messages are cmd1=2B with cmd2=(the contents of that byte). The address of my PLM is 20.F5.F5, the majority of the errors were caused when the mid and low byte of my PLM were sent in consecutive peek messages. But I also had a number of instances in which data2 and data3 in a ALDB record were both 00, and were erroneously caught. Finally I had one instance in which the group number and the hi byte of the device were the same number and were erroneously caught.
(2 instances) of extended length ALDB read request ACKs. ACKs for extended read write ALDB are sent as standard messages from the device. They are all identical cmd1=2F and cmd2=00. Even with an extended message sent by the device in between, the amount of the delay was still long enough to allow two false positives.

But the worst case was when I factory reset a device and rewrote all of the links. The device had 8 links, and it caused 7 false positives. In this case the PLM sends an extended length command with the full ALDB record, but the device only responds back with a standard length 2F00 command and nothing else. Since the ACK is identical every time it was caught as a duplicate every time.

A few months back I was dealing with corruption in the ALDB records, and I have to say that is the worst possible scenario in MH. You can get caught in an infinite delete_orphans sync_links loop and never get a clean record of the links. So I would rather risk a duplicate message slipping through, rather than create the potential for ALDB corruption.

I think the best solution is to add in a user configurable setting in private.mh.ini to adjust the additional "padding" beyond the default hop_count message_time calculation.

dracoventions commented 11 years ago

I think the best solution would be to not include certain messages in the duplicate filter. Or even better might be to ONLY INCLUDE certain messages in the filter - namely "command: on or off; type: cleanup or alllink;". If other duplicate messages are found to cause problems they can always be added later.

krkeegan commented 11 years ago

Oh, good call. Only using the filter on a defined list of message types would work well.

krkeegan commented 11 years ago

OK, I pushed an update to the branch that achieves what Chris mentioned above.

However, I am not totally happy with the result. The code as is is a very low level process but it is being limited to a rather high level function of preventing multiple set calls in a short period of time.

I think the better option is to split these two features. The is_duplicate process is still valuable if the duplicate window is cut back to the insteon spec. MH already waits to send a message if it sees an incoming message with hops_left, it makes sense that it should also ignore identical messages in this same time frame.

Then to solve Eloy's issue, I will add a check in the set routine to prevent re-setting a device to the same state in less than 1 second.

peloy commented 11 years ago

Hi Kevin,

On 05/07/2013 09:49 PM, krkeegan wrote:

OK, I pushed an update to the branch that achieves what Chris mentioned above.

However, I am not totally happy with the result. The code as is is a very low level process but it is being limited to a rather high level function of preventing multiple set calls in a short period of time.

I think the better option is to split these two features. The is_duplicate process is still valuable if the duplicate window is cut back to the insteon spec. MH already waits to send a message if it sees an incoming message with hops_left, it makes sense that it should also ignore identical messages in this same time frame.

Then to solve Eloy's issue, I will add a check in the set routine to prevent re-setting a device to the same state in less than 1 second.

Should I wait until you add this check or you'd like me to upgrade to the current code in the branch? Things are working well for me right now.

Cheers,

Eloy Paris.-

krkeegan commented 11 years ago

It duplicated chris's changes, so there shouldn't be much need to testing at the moment.

On Tue, May 7, 2013 at 7:08 PM, Eloy Paris notifications@github.com wrote:

Hi Kevin,

On 05/07/2013 09:49 PM, krkeegan wrote:

OK, I pushed an update to the branch that achieves what Chris mentioned above.

However, I am not totally happy with the result. The code as is is a very low level process but it is being limited to a rather high level function of preventing multiple set calls in a short period of time.

I think the better option is to split these two features. The is_duplicate process is still valuable if the duplicate window is cut back to the insteon spec. MH already waits to send a message if it sees an incoming message with hops_left, it makes sense that it should also ignore identical messages in this same time frame.

Then to solve Eloy's issue, I will add a check in the set routine to prevent re-setting a device to the same state in less than 1 second.

Should I wait until you add this check or you'd like me to upgrade to the current code in the branch? Things are working well for me right now.

Cheers,

Eloy Paris.-

— Reply to this email directly or view it on GitHubhttps://github.com/hollie/misterhouse/issues/169#issuecomment-17582650 .

krkeegan commented 11 years ago

OK, I made some extensive changes.

I forked the functions as I discussed before. So now there is a check in BaseObject that prevents a device from being set to the same state more frequently than once a second.

I reworked the message window calculation in is_duplicate. I tried to have some basis in the specification for the design. Based on the design the message window is much longer for messages that require an ACK such as a set command from a device. But the message window is much shorter for incoming ACKs similar to what comes in during an ALDB scan.

I have used it quite a bit tonight and it seems to be working well. It has caught a number of legitimately duplicative messages without any false positives. It has been working so well that it caught a lot more duplicative messages in the ALDB scan than I realized were occurring. It also caught the duplicate set commands from the device without needing to rely on the fallback code which stops duplicate set commands in 1 second.

The fix_issue_169 tree has all of the updates.

peloy commented 11 years ago

Hi Kevin,

On 05/09/2013 12:27 AM, krkeegan wrote:

OK, I made some extensive changes.

I forked the functions as I discussed before. So now there is a check in BaseObject that prevents a device from being set to the same state more frequently than once a second.

I reworked the message window calculation in is_duplicate. I tried to have some basis in the specification for the design. Based on the design the message window is much longer for messages that require an ACK such as a set command from a device. But the message window is much shorter for incoming ACKs similar to what comes in during an ALDB scan.

I have used it quite a bit tonight and it seems to be working well. It has caught a number of legitimately duplicative messages without any false positives. It has been working so well that it caught a lot more duplicative messages in the ALDB scan than I realized were occurring. It also caught the duplicate set commands from the device without needing to rely on the fallback code which stops duplicate set commands in 1 second.

The fix_issue_169 tree has all of the updates.

Thanks for the improvements. I'm now running this latest code to help with testing. I'll report any issues.

Cheers,

Eloy Paris.-

pork13 commented 11 years ago

Likewise. Thanks!

krkeegan commented 11 years ago

This is still working well for me. So far it has only had one false negative which was the ACK of a status request, not a big deal that it was missed.

I have not had any false positives.

Nor have I had any instances in which the set within 1 second logic was needed.

dracoventions commented 11 years ago

I've been running this the last couple days and haven't seen any problems so far. I only had time to do a couple simple tests as I'm going out of town tomorrow. None of the regular things I use have malfunctioned and I did a couple of ALDB scans successfully.

krkeegan commented 11 years ago

Eloy's recent work uncovered a bug that has been duplicated throughout the MH code and which I duplicated here.

The bug is in the code I added to BaseObject which prevents the calling of set_receive more than once per second. I allow set_receive to be called more frequently if the state set does not match the current state of the device. This doesn't work properly when multiple set calls are made in the same loop.

To fix this I just need to add a check to state_final.