Closed mguaylam closed 3 years ago
@dmulcahey I also have seen the same issue, although only when routing through various repeaters. When the lock connects directly to the coordinator (NORTEK stick), it works perfectly, but when going through a repeater (in my case, SmartThings smart plugs and/or Ikea signal repeaters), the message fails getting to the lock.
I can provide a Wireshark dump as well if needed (in private). 🙂
In the log there was only one delivery failure for the log out of 16 attempts. And judging by the time it took - 5 seconds, apparently the failure came from the network. I don't really know what could be done here, ZHA sends the requests, stick sends out the request, if it fails, then if fails.
2021-01-07 20:16:43 DEBUG (MainThread) [bellows.ezsp.protocol] Send command sendUnicast: (<EmberOutgoingMessageType.
OUTGOING_DIRECT: 0>, 0xBB33, EmberApsFrame(profileId=260, clusterId=257, sourceEndpoint=1, destinationEndpoint=1, op
tions=<EmberApsOption.APS_OPTION_ENABLE_ROUTE_DISCOVERY|APS_OPTION_RETRY: 320>, groupId=0, sequence=244), 245, b'\x0
1\xf4\x01')
2021-01-07 20:16:43 DEBUG (MainThread) [bellows.ezsp.protocol] Application frame 52 (sendUnicast) received: b'00a4'
2021-01-07 20:16:48 DEBUG (MainThread) [bellows.ezsp.protocol] Application frame 63 (messageSentHandler) received: b
'0033bb04010101010140010000a4f56600'
2021-01-07 20:16:48 DEBUG (MainThread) [bellows.zigbee.application] Received messageSentHandler frame with [<EmberOu
tgoingMessageType.OUTGOING_DIRECT: 0>, 47923, EmberApsFrame(profileId=260, clusterId=257, sourceEndpoint=1, destinat
ionEndpoint=1, options=<EmberApsOption.APS_OPTION_ENABLE_ROUTE_DISCOVERY|APS_OPTION_RETRY: 320>, groupId=0, sequence
=164), 245, <EmberStatus.DELIVERY_FAILED: 102>, b'']
2021-01-07 20:16:48 DEBUG (MainThread) [zigpy.device] [0xbb33] Delivery error for seq # 0xf4, on endpoint id 1 clust
er 0x0101: message send failure
I can provide a Wireshark dump as well if needed (in private). slightly_smiling_face
@mguaylambert can you compare the difference in the failed requests and the one which worked? Were do they fail? do they reach the parent device? Does the lock polls for that message?
I feel like zha does everything possible: it does extend the timeout and waits for the reply for 30s, but it fails much faster than that, meaning network knows there was a failure.
BTW, for nortek, try enabling the source routing and check if it makes a difference. To enable, add the following in configuration.yaml
zha:
zigpy_config:
source_routing: true
ezsp_config:
CONFIG_SOURCE_ROUTE_TABLE_SIZE: 32
Thank you so much for looking into this @Adminiuga !
I did add the source_routing to my configuration prior to analyzing the network. After a couple days of testing, it doesn't seem to change the success / failure rate.
After observing the behaviour of the network, I have gathered 3 examples. Two of them failing, one of them working. The difference between them to me look to be : when the stick sends two unlock commands back to back, the lock seems to suffer from an internal failure (it says the unlock command was successful but didn’t unlock). When ZHA sends only one, everything works.
You can see from this file, Failure-1.txt , ZCL Door Lock: Unlock Door, Seq: 102
got sent twice on the network and the lock ended-up receiving it twice as a consequence. The lock correctly polls the data. Also, the lock finishes by sending a ZCL Door Lock: Unlock Door Response, Seq: 102
twice but fails to unlock from it. This is why it does not report any unlock attribute since it physically didn’t unlock.
Here is the log from Home-Assistant at the same time frame : Failure-1.log
Here we can observe the same pattern as Failure-1 : Failure-2.txt With its associated Home-Assistant log : Failure-2.log
Here things are different (and they succeeded). ZCL Door Lock: Unlock Door, Seq: 203
is sent only once over the network, the lock polls the data and the lock physically unlock. We can also see that it reports its new attributes : ZCL: Report Attributes, Seq: 132
(unlocked) to HA. One frame is missing from my capture (ZCL: Report Attributes, Seq: 132
) (Success-1.txt) but everything suggests that it did reach the coordinator.
And here’s the associated log file : Success-1.log
So, I am wondering if the lock is not faulty here (or the module) since I guess receiving twice the same command should not change much in the process (should the stick sends twice the same command?).. Also, unfortunately, it does not explain yet why sometimes I see some message send failure (I’ve yet to see those which is surprising tbh.).
Thank you for any advice and your valuable time!
@Adminiuga How is it with the application layer ack ? If not getting one ack the application should resending the same message if its being done with unicast but i cant see how its being dont in the logs :-(
Yeah, there's definitely some strangeness going on in the capture, but I can't quite make it out, without expanding the network layer and frankly I don't dig that deep often, so a bit fuzzy on this.
But it does seem like there way to many commands sent, some duplication is normal when the sniffer hears all the retransmissions relayed messages, but then, when the Unlock door response is being sent, there way too many APS acks bouncing around, aren't there?
What firmware is running on Ikea? Is it updated to the latest ZB3 firmware?
@Adminiuga : Yeah actually in all 3 cases (both failures and successes), we get DEBUG (MainThread) [homeassistant.components.zha.core.channels.base] [0xBB33:1:0x0101]: executed 'unlock_door' command with args: '()' kwargs: '{}' result: [<Status.SUCCESS: 0>]
and that’s because the lock seems to lie about it. Making me wonder if the problem is not partially with the lock itself. I am thinking about reaching Yale about it and show them that the lock says unlock_door = success
without really doing it when it receives 2 unlock commands back to back. One thing I wonder is, why does the coordinator send twice : ZCL Door Lock: Unlock Door
in the failed examples. Is it normal? And who’s responsible for doing that, ZigPy or the coordinator? Both are the same sequence. It’s almost like the coordinator did not catch the network ACK (but I did, which is weird because I'm very far from them) and it decided to re transmit it. And yet, shouldn't the Philips Hue bulb should discard it since it received the same message twice?
I am no expert in ZigBee and pardon me for that, I am learning steeply how it all works but I wonder, would prevent the [stick or ZigPy?] from transmitting twice ZCL Door Lock: Unlock Door, Seq: 252
solve the issue? Or should I look with Yale why their lock is behaving that way?
As for the APS: Ack, Dst Endpt: 1, Src Endpt: 1
it looks good to me considering that there is a bulb relaying and since the stick sent ZCL Door Lock: Unlock Door, Seq: 252
twice, it is just replying to both commands. You can make the difference who is sending or relaying by looking at the IEEE 802.15.4 Data, Dst: 0x88cc, Src: 0x0000
(coordinator to bulbs) and IEEE 802.15.4 Data, Dst: 0xbb33, Src: 0x88cc
(bulbs to lock).
As for firmwares, all my devices as of today, they look up to date. The IKEA bulbs are on 2.0.029
which I believe is ZigBee 3.0 and the Philips Hue on which the lock is a child of is is on 1.65.11_hB798F2B
which I believe is the latest firmware (I update them via Bluetooth).
I can always send you the Wireshark captures if you want (filtered) but that would be in private since I need to give you my security key too.
@Adminiuga : Yeah actually in all 3 cases (both failures and successes), we get DEBUG (MainThread) [homeassistant.components.zha.core.channels.base] [0xBB33:1:0x0101]: executed 'unlock_door' command with args: '()' kwargs: '{}' result: [<Status.SUCCESS: 0>] and that’s because the lock seems to lie about it. Making me wonder if the problem is not partially with the lock itself. I am thinking about reaching Yale about it and show them that the lock says unlock_door = success
You do get:
ZigBee Cluster Library Frame
Frame Control Field: Cluster-specific (0x19)
Sequence Number: 102
Command: Unlock Door Response (0x01)
Payload
Lock Status: Success (0x00)
Door lock response from the lock and it success. If lock reports it, but doesn't do anything, then it could be the problem with the lock.
Have you tried replacing the fresh batteries and checking that lock is not "binding" physically, i.e. it tries to lock/unlock but can't move physically?
If not finding anything else in the logs try changing the locks parent so its not using the Philips HUE bulb like putting one IKEA bulb or plug in the near and from the routers card "adding devices via this device" and resetting the lock for getting it rejoin the network. I can being that the philips bulb is not releasing the lock then power it down.
Sniff logs its great than all information is there but not nice sharing the network key on the open forums :-((
@Adminiuga : Exactly. The lock sends twice : Lock Status: Success (0x00)
where it actually does nothing physically, not even tries to do anything actually (no sound, no movement). I’ve inspected it, the lock can move freely without any friction. Also, I would have expected the lock to report an attribute of being blocked physically if it was the case but I did not see anything about it in the capture.
I will try a fresh set of batteries to see but the ones in it we’re the ones that came with it and the lock is fairly new. Still reporting 80% of battery.
@MattWestb : that’s the strangest part. When the lock is a child of the coordinator, it works perfectly all the time. Where when it is a child of a router (I tried both Philips Hue, IKEA bulbs & IKEA smart outlet) it fails fairly often. I used to just force the lock to go on the coordinator but it is not really realistic as it change it’s parent from time to time since there is closer devices to it.
And yeah, sharing the network key is not something I should do publicly indeed. One thing I could do is either output all the layers in the text file (heavy reading) or put it in a JSON.
I have reading of locks that dont like being on on router and only working then being direct connected to the coordinator bu i was not reading how it was working or not. My Philips HUE bulb is not deleting childs then they is changing router or leaving the network but its for the moment not in your case then you can see both received and transmitted messages from the lock.
The best if Adminiuga can taking one deep dive in the log perhaps hi can see somthing but its sounds little like the lock have some "spontaneous free days" then its like and you not.
Then looking on the map is the lock having very good signal to the Philips bulb.
@Adminiuga can it being missing ack in the application layer like the switch that you was merged earlier today ? The lock dont have any quirk but if its "good behaving".
can it being missing ack
No. you don't send a response to a response.
True and the lock have replayed withLock Status: Success (0x00)
:-((
@Adminiuga I was able to recreate a message send failure
as well to conclude my findings but here’s the catch :
I see [0xbb33] Extending timeout for 0x57 request
in the Home Assistant logs but when looking at the frames
, it took a little more time than usual but still only 6 seconds before the coordinator got a confirmation from the lock that it has received the command (The bulb was waiting for a data request from the lock and I guess the lock was waiting for the channel to be free.) The network was able to deliver every frame correctly and the lock got unlocked just fine.
I can see the coordinator sent 3 unlock commands initially (in less than 3 seconds) and got all passed to the bulb correctly meaning the two other commands were really unnecessary here. Why is the coordinator doing this? And why does it tell ZigPy that a failure happened (most of the time, even if it says there was a network failure, everything is fine).
ACK
on the MAC link and ZCL Door Lock: Lock Door Response = success
on the application level but fails to physically do anything.
data request
from the lock to send anything. And does not spit many commands at once thereafter since it has direct feedback.message send failure
’s I get from time to time doesn’t look rights as the lock still does unlock (outside the bug it has) and there is no miscommunication on the network.It is clear to me the lock has a software issue. Considering this, I will reach Yale about it. Meanwhile including the previous questions, can we somehow mitigate the number of commands the coordinator spits on the network and why does the coordinator report a message send failure
when the frames look all good? Does it mean it was not capable to place the frame on the network?
Thank you for your time and help, it is really appreciated.
The problem
When my Yale YRD226 TSDB connect thru a router (Philips Hue or IKEA Tradfri bulbs) instead of the coordinator directly, some lock / unlock commands fails. This issue does not happen when the door lock is directly connected to the coordinator.
The issue has been replicated with both IKEA Tradfri White Spectrum and Philips Hue White ambiance bulbs (tried the Philips Hue in the hope it would solve the problem)
The problem is intermitent. It works for a while and other times it get stucked for a short while.
Thank you in advance for your great work and help!
Environment
Problem-relevant
device signature
Problem-relevant
configuration.yaml
Traceback/Error logs
Full log : HALog.zip
Additional information