home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
73.55k stars 30.73k forks source link

Fast battery drain on Yale YRD226 door lock with ZHA #49355

Closed mguaylam closed 3 years ago

mguaylam commented 3 years ago

The problem

I am seeing a fast battery drain on my door lock since ive updated to 2021.4.X (or at least, from what I remember). To better understand what is happening I did a capture of my ZigBee network and ive seen a strange behaviour : the coordinator keeps sending ZCLs Poll Control : Set Long Poll Interval to the door lock. It does this for a while after having locked or unclocked the door lock but I am still not sure about the patern.

Usually I would see something like 0.2% battery drain everyday but right now it’s as high as 1%.

ZigBee coordinator : HUSBZB-1

Door lock signature : { "node_descriptor": "NodeDescriptor(byte1=2, byte2=64, mac_capability_flags=128, manufacturer_code=4125, maximum_buffer_size=82, maximum_incoming_transfer_size=255, server_mask=0, maximum_outgoing_transfer_size=255, descriptor_capability_field=0, allocate_address=True, complex_descriptor_available=False, is_alternate_pan_coordinator=False, is_coordinator=False, is_end_device=True, is_full_function_device=False, is_mains_powered=False, is_receiver_on_when_idle=False, is_router=False, is_security_capable=False, is_valid=True, logical_type=<LogicalType.EndDevice: 2>, *user_descriptor_available=False)", "endpoints": { "1": { "profile_id": 260, "device_type": "0x000a", "in_clusters": [ "0x0000", "0x0001", "0x0003", "0x0009", "0x000a", "0x0020", "0x0101", "0x0b05" ], "out_clusters": [ "0x000a", "0x0019" ] } }, "manufacturer": "Yale", "model": "YRD226 TSDB", "class": "zigpy.device.Device" }

image

What is version of Home Assistant Core has the issue?

core-2021.4.3

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant Container

Integration causing the issue

ZHA

Link to integration documentation on our website

https://www.home-assistant.io/integrations/zha

Example YAML snippet

zha:
  zigpy_config:
    ota:
      ikea_provider: true
    source_routing: true
    ezsp_config:
      CONFIG_SOURCE_ROUTE_TABLE_SIZE: 32

Anything in the logs that might be useful for us?

home-assistant.log

Additional information

No response

probot-home-assistant[bot] commented 3 years ago

Hey there @dmulcahey, @adminiuga, mind taking a look at this issue as its been labeled with an integration (zha) you are listed as a codeowner for? Thanks! (message by CodeOwnersMention)

MattWestb commented 3 years ago

I have not looking what is making the coordinator do this but the normal is is the device have one pull control cluster that is working normal ZHA is setting up pull control in the sleeping end device. The end device is then every 55 minutes doing on check in that is sent to the coordinator and its responding to it. If have somthing that need being sent its setting short pull and sending the information to the device and then sending long pull or time out the short pull. The "24" in long pull interval is 6 seconds (24+1/4 second) its very intensive but the devs like having like that in general i have getting they disabling it for IKEA controllers that is now using the (IKEA) default 1160 (4.88 minutes) but i dont knowing if its needlessly for this device or not then 4.88 seconds is making it behaving as being online all the time but can taking up to 4.88 seconds before the command is arriving to the device.

If you like testing you can patching the "global" long pull time that we have doing for testing with IKEA remotes for see how its working with your device (its described in the discussion of configuring IKEA remotes).

If you like see all the stages one transaction is doing in WS you can using one more defined filter in WS like that:

(wpan.src64 == 00:0d:6f:ff:fe:51:43:5c) || (wpan.dst64 == 00:0d:6f:ff:fe:51:43:5c) || (wpan.src16 == 0xd350) || (wpan.dst16 == 0xd350) || (zbee_nwk.src == 0xd350) || (zbee_nwk.dst == 0xd350)

(your zbee_nwk.addr is nicer !!, The long and short address is for one of my IKEA remotes)

Then you is seeing all IEEE and short address in the 802.15.4 layer and also zigbee short addresses that being sent like if your end device is using more routers for sending / receiving messages but you is not seeing the 802.15.4 ack that all devices shall sending then they is not gaving and short or long addresses and no PAN-ID only sequence numbers. I have adding the IEEE short addresses as columns so i can see all the jumps the message is doing from the device on the road to its destination. WS-AttrR One remote is reporting 3 attributes to its parent and the parent is sending it to the coordinator and with all acks in every hope the message is doing with all sequence numbers in the GUI :-))) One thing with attribute reporting from sleeping end devics is using passive ack = the device is sending the attribute to its parent and getting one IEEE ack and then its listening and if the parent is reseeding it to one neighbor = its the same its have arrived to its destination and dont one Zigbee ack.

Adminiuga commented 3 years ago

There were no changes in poll control configuration for yale locks in the new version. The set long poll interval command is send for each poll check in received from the device. What you see in the packet capture, is actually a single check-in & set long poll interval transactions, you could see that by the sequence number for each command & default response.

How many check ins do you have in an hour?

mguaylam commented 3 years ago

Thank you for the clarification. Look like you are right. It seem’s the lock is checking in every 2 seconds so that would mean 1800 times per hour. Should I.. remove the batteries and put them back in ? 😅 @Adminiuga

image

Thank you as well @MattWestb for your guidance. Always nice to hear from you! I was indeed reading that it was quarter seconds but wasn't sure so thank you for confirming that! :) BTW, what do you mean by my 'zbee_nwk.addr is nicer' ? :)

MattWestb commented 3 years ago

You was using one filter with "zbee_nwk.addr eq 0x9486" and i have doing it like "(zbee_nwk.src == 0xd350) || (zbee_nwk.dst == 0xd350)" because i was not knowing the shorter form.

For you lock try doing one "reconfigure" and sniff if its start using long pull 24 = 6 seconds and is not do one reset of the device and open the network for joining so it can rejoining.

But also the question is shall it pulling its parent so oft so it can receiving commands from the coordinator ? If making long pull longer its taking long time for commands being received then ZHA is sending some thing to it.

IKEA remotes dont need receiving command then they is made for only sending commands then some is pressing one button.

MattWestb commented 3 years ago

By the way i think you dont have any quirk for your lock: "class": "zigpy.device.Device"

But the basic function shall working normal without it as the pull control.

mguaylam commented 3 years ago

@MattWestb just tried the reconfigure option but that did not change the behaviour. You are right that there is no quirk for this lock. I know some of them need to double the power cluster but not the YRD226.

Adminiuga commented 3 years ago

Yes, try resetting the lock and maybe rejoining it without removing it from zha. For the lock to check in every 2s is not normal and I'm not sure what causes it. Polling cluster configuration was introduced long before 2021.3 version , so there shouldn't be changes in the behavior.

mguaylam commented 3 years ago

@Adminiuga It still does it unfortunately after going thru the reset process. What’s weird is that it does it after about an hour (after restarting Home Assistant) and then pretty much wont stop from there.

I’m checking the various attributes the coordinator sends in. Those look good right?

Check-in interval ⬅

image

Check-in ➡

image

Short poll ⬅

image

Long poll ⬅

image

I do have the same module twice. I could try the other but not right now because my neighbour in the building is going to kill me for this door lock that always say : " Welcome to the secure world! " when I replace the batteries.

I guess the module is faulty? If so, I don’t see any way to reset it.

Aside, do you own a ZigBee lock? If so, which one do you recommend?

MattWestb commented 3 years ago

The check in is sett to 55 minutes (13200 / 60 / 4 (1/4 second = 55 min). The fast pull is set with 8 seconds timeout = OK Is you getting one ack / reprocess of the setting the long pull (SQNr 241) ?

Ii think deCONZ was having problems with smatthings was not going back to long pull because it was not OK configured and failing back to fast pull and burning batteries.

But if you is getting on ack / response it shall being OK.

Trying reading and setting the long pull interval with cluster commands and see wot the device is having in its conffig.

mguaylam commented 3 years ago

Tried this, but unfortunately after about an hour it came back to the previous issue. Weird enough, it did reset the door lock but the module still was on the network, operable. image

@MattWestb yep look like it got accepted : image

mguaylam commented 3 years ago

Could I set the cluster 0x0004 (Check-inIntervalMin) from 0x0020 (PollControl) to perhaps something like 6600 on the server? Would that maybe prevent the lock from killing itself?

MattWestb commented 3 years ago

Long pull interval (id: 0x0001) you can read and also writing. The normal is 24 (1/4 seconds)= 6 seconds and for IKEA controlling devices 1160 = 4.88 minutes.

ZHA is setting the long pull to 24 after one checking from the device that happens every 55 minutes.

Try setting it to somthing higher and sniffing what is doing.

Also then you is pairing the device its requesting on "end device time out" and the router is acking it. In the request is one time and you shall not setting the long pull time longer then that (long pull time out must being shorter then it) or your device is rejoining the network then it like pulling its parent.

MattWestb commented 3 years ago

Try changing: checkin_interval (id: 0x0000) short_poll_interval (id: 0x0002)

in small steps and look if the device is changing the pulling of its parent.

Adminiuga commented 3 years ago

Add your lock's manufacturer Id (can be obtained from zigbee signature from the node descriptor) to https://github.com/home-assistant/core/blob/99c5087c1e6868a3d7e819e1084728c5dcc558ac/homeassistant/components/zha/core/channels/general.py#L395

MattWestb commented 3 years ago

Also possible deleting the pull control cluster in the device quirk (if it have one) but the "Adminiuga patch" is working well but its making all devices from the manufacture.

mguaylam commented 3 years ago

Yeah I was considering to add it to the list of ignored manufacturer for polling in last resort.

Weirdly I was able to make it stop with a fast_poll_stop (0x0001) command. image image

After looking at it there is a weird loop going on where the coordinator is asking to extend the fast polling by 2 seconds at each check-in making it almost never ending. Shouldn't it responds only with the long poll interval and not short poll AND long poll?

image

I’m testing fast_poll_stop and it works every time until it need’s to check in normally.

Adminiuga commented 3 years ago

Coordinator is only replying to the CheckIn command sent by the lock. The reply to this command says: do fast polling for 2s. After that it send the set checkin interval to 55min.

After two seconds, device should stop fast polling, but instead it sends another check-in command and the cycle repeats.

Can you downgrade the ha and retry?

MattWestb commented 3 years ago

What is the "checkin_interval (id: 0x0000)" in the device ? ZHA have setting it to 13200 = 55 min in ZB3 end sleeping devices. Its looks your device is having some strange config or is one ack/response missing that is making it looping.

mguaylam commented 3 years ago

@MattWestb oh that’s weird, the lock says : 14400 in checkin_interval (0x0000) I was expecting 13200 indeed. Trying to set it to 13200 seems to break it as it returns none now and the behaviour did not change. I did go to HA 2021.3.0 and reconfigure the device and now it says 13200.

@Adminiuga clearly something changed in the ZigBee module as this behaviour was not there as far as I remember so I can’t blame ZHA for this. But when reading the ZCL specs, I'm confuse on what is supposed to really happen. When I read the ZCL specification, it says that

If the Poll Control Cluster Client indicates that it would like the device to go into a fast poll mode, it is responsible for telling the device to stop fast polling when it is done sending messages to the fast polling device.

image

So during the fast poll mode, it is supposed to only make data requests to it’s parent and not new check-in’s like we saw earlier? But at the same time, looking at this post, it suggest that the client should always end the fast poll window with an exit fast poll mode check-in response. So I am to wonder if the server is waiting for a fast poll exit but since it is not receiving it, it relaunch the check-in when the previous one expired. And when I send fast poll stop commands, it seems to exit it successfully.

image

I guess the intent here is that if the client miss transmitting everything when the fast poll window presents itself, the server tries to give it some more time until it receive the fast poll exit?

Many thanks for your time on this project BTW, I really wish I could give you guys a monthly donation for the hard work on this! 🙂

Adminiuga commented 3 years ago

yeah, you are right. per specs

If the Poll Control Cluster Client indicates that it would like the device to go into a fast poll mode, it is responsible for telling the device to stop fast polling when it is done sending messages to the fast polling device

Looks like we do need to send stop fast polling command. I wonder if it would have fixed the ikea battery draining too?

Adminiuga commented 3 years ago

on the other side, the timeout in the "check-in" response should take precedence image

i.e. it should do fast polling until it gets the fast poll stop command or this timeout lapses.

mguaylam commented 3 years ago

That's interesting! Its weird that they chose to take precedence on the timeout where the STOP might actually be more efficient to my eyes : very long fast polls can be shortened to save battery life when it is no longer needed.

And yeah, look like it might be what IKEA needs. At least, I can confirm it works for Yale where a Fast Poll Stop exit the window and wait until the next check-in.

Do you know why they say it is mandatory where the timeout should do it alone? image

MattWestb commented 3 years ago

This is interesting !!

First i like the mechanism that is making it possible safely sending data to on SED that is not doing intense pulling. If the coordinator like sending something to it its doing on check in every 55 min and getting the command little late but OK for most SED that dont need being online all the time.

Its looks little like the fix de(F)CONZ was doing with smartthings that was locked in fast pull.

The reason Zigbee is having both time out and stop fast pull is if the coordinator is going offline or the command is not going thru to the client so its stopping fast pull itself and not burning batteries.

I have seen both method being used in sniff but i cant remember where the second variant (deCONZ / Tuya ZBGW).

I have not checking but is ZHA sending one response = false if the coordinator is not having any "news" for the SED or is doing one true and then sending on set long pull interval without fast pull stop ? I think the its doing on complete check in also if not have any thing to sending the the device. If sending false its saving some extra command sent from the device that is helping battier life in the end.

For the moment IKEA controllers is doing some extra not needed pull after one check in is finished but its no big deal then implanted the "11660" long pull.

For my is the right way ending one check in with fast pull stop command and not setting long pull interval that have making the "IKEA fix" not being needed then the long pull interval was not rewritten in the check in sequence and only being sett then paring or reconfigure the device or by user sending cluster commands to the device. The problem is that it can being devices out there that need the set long pull interval in the end that wi is not knowing for the moment.

MattWestb commented 3 years ago

By the way @Adminiuga I have recommended the Sonoff hard/software devs sending some reference devices to Zigpy devs then releasing new devices for getting more feedback from real life and easier finding hard and software problems (they is having).

Adminiuga commented 3 years ago

find this file in your install https://github.com/home-assistant/core/blob/686c92097fb1ab44c6d9d377c54d70d987412adf/homeassistant/components/zha/core/channels/general.py#L423-L424 and make it look like

        if self._ch_pool.manufacturer_code not in self._IGNORED_MANUFACTURER_ID:
            await self.set_long_poll_interval(self.LONG_POLL)
        await self.fast_poll_stop()

essentially add await self.fast_poll_stop() as the last line of the check_in_response() method

mguaylam commented 3 years ago

Look like you are the guy @Adminiuga ! Initial tests are looking good. I will test it a little more during the day to see how it goes. Will also try the IKEA end devices without the poll exclusion to see if this fixes it. image

Many thanks again!

MattWestb commented 3 years ago

THAT GREAT !!!

I think IKEA devices is also liking the "Adminiuga patch 2" with out setting long pull and instead sending fast pull stop.

The impotent thing is that the device must have getting the long pull configurated or it can being that is hanging in short / fast pull.

MattWestb commented 3 years ago

One new sniff made with one EFR32MG21 (from one LIDL / tuya LED Stripe controller) with EZSP 6.9.2.0 as coordinator and one MG1P with EZSP 6.7.8.0 as sniffer. Its only on IKEA remote and its working well with the updated with not setting long pull time and and only doing the time out being stetted to 8 in the checkin response: IKEA_REMOTE03 All IEEE 801.15.4 Ack is nicely being done and the remote is going in long pull itself after the last received command from the coordinator. I have filtering out all braodcast than its not interesting in this scenario.

beren12 commented 3 years ago

Would this affect the ikea remotes at all? Even with latest release I'm getting 5 days battery life.

Adminiuga commented 3 years ago

Try it and report back. I couldn't reproduce it in my environment

mguaylam commented 3 years ago

Sorry for the delay.

Here’s are the results : The Yale lock is doing even better than before when it comes to battery life. I’ve basically done 11 days and it did not drop of a percent yet. As for IKEA remotes with the fast poll control stop and the removal of the poll control ignore list, it does not fix IKEA remotes. For me it still dies after a couple of days. I don’t care as much for the IKEA remotes, i’m giving up on them anyway..

Many thanks again @Adminiuga !

Adminiuga commented 3 years ago

Can you please confirm, after new beta was installed, the Ikea remote was factory reset and rejoined?

beren12 commented 3 years ago

I applied the patch here in hopes it will help my IKEA remotes and it died after three days this time. I repaired, reconfigured and no real difference.

MattWestb commented 3 years ago

@mguaylam You is one of a kind "advance user" that have fast draining of IKEA remotes (48 hours on one battery) and have the knowledge of sniffing zigbee traffic. Have you seen if they is doing some strange like leaving and rejoining or being locked in fast or short pull or sending beacons all the time or OTA requests ? I have sniffing over 100 hours and cant see any thing but i have not the extreme battery draining but i still can being hardware failure in the remotes and also rigged bugs in the firmware.

If i was having your setup i wold trying doing on megasniff with only the coordinator and one problem remote and see what its doing in 24 hours and then on new 24 hours sniffing.

Sonoff is decently having hardware problems that is making it happens but other ncp looks working better but we dont knowing way :-(((

Sorry for spamming your issue but im glad ZHA have fixing it so its working well with the lock !!