home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
72.91k stars 30.52k forks source link

quick battery drain on Legrand wireless dimmers #89398

Closed stevenwfoley closed 1 year ago

stevenwfoley commented 1 year ago

The problem

The batteries in my Legrand Wireless dimmers are lasting less than a week. Legrand says the batteries will last 7-10 years, but only when using their zigbee coordinator/gateway device.

What version of Home Assistant Core has the issue?

2023.3.1

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

zha

Link to integration documentation on our website

https://www.home-assistant.io/integrations/zha/

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

I've read about all the ikea (and other manufacturer) related battery drain issues, but none of those resolutions have helped with these devices. I tried adding the Legrand (4129) manufacturer to _IGNORED_MANUFACTURER_ID. I also tried removing the Polling cluster entity from the zha quirk. I've taken some wireshark captures of the switch joined to a Legrand/Netatmo coordinator, and joined on my HA/ZHA coordinator. I have noticed some discrepancies, but I have no idea how to effect any change to debug. The capture of the Legrand coordinator issues one set of "End Device Timeout Request" and "End Device Timeout Request Response" packets but with ZHA they come every few minutes. The differences I notice is in the "End Device Timeout Response, Success" packets.

The Legrand coordinator packet has this data:

The ZHA coordinator packet has this data:

Legrand coordinator capture: image

ZHA coordinator captures: image

Thanks for any help!

home-assistant[bot] commented 1 year ago

Hey there @dmulcahey, @adminiuga, @puddly, mind taking a look at this issue as it has been labeled with an integration (zha) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `zha` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign zha` Removes the current integration label and assignees on the issue, add the integration domain after the command.

(message by CodeOwnersMention)


zha documentation zha source (message by IssueLinks)

puddly commented 1 year ago

Thank you for the detailed debugging! Would it be possible for you to ZIP up and upload the packet captures here (with the network key) so that I can take a look at the raw traffic as well?

The end device polling its parent too frequently is likely the cause.

stevenwfoley commented 1 year ago

Attached. My zigbee network has over 100 devices, so there's a lot of data. I hope that my applied filter is not too limiting. It's showing all packets with a source and destination of my wireless dimmer device. Everything else just looked like irreverent chatter. legrand coordinator.zip zha coordinator.zip

MattWestb commented 1 year ago

Then paring one real Zigbee 3 end device its requesting one end device timeout from its parent after it have getting the network key. If the parent is not supporting it (like TI coordinators) they is normally not replaying to it and the device is using Zigbee HA 1.2 standard for pilling its parent > bad battery performance or leaving the network (TI in Z2M). The parent 0x09554 in ZHA is wot device ? and 0x04dcf in the native hub ? Both is routing devices but not the coordinator that shall having address 0x0000.

Disabling the pullcontrol cluster is not helping only make the battery draining worse then the device must pulling its parent more oft.

I looking little in the sniffs and see if i can see somthing strange.

PS: Z2M is having large problems with this switches leaving with TI coordinators as its not responding the the request but is working OK with other routers as parent.

Edit: Interesting is one complete paring with both GW so can see all things being set up in the sniffs and not some commands.

stevenwfoley commented 1 year ago

My zha coordinator is a sonoff dongle-E, so SI not TI. The parent 0x09554 in zha is a zigbee 3.0 light (zemismart) that is controlled by the legrand dimmer, so it never looses mains power. Sorry, I do not know if 0x04dcf is the native legrand hub.

Attached is a full capture of the join on both coordinators. Please let me know if theres anything else I can test or try.

capture.zip

MattWestb commented 1 year ago

ZHA looks OK but the device is acting little stage. Interesting frames: End device timeout request 8 min

ZigBee Network Layer Command, Dst: 0xdda4, Src: 0x3f83
    Frame Control Field: 0x1a09, Frame Type: Command, Discover Route: Suppress, Security, Destination, Extended Source Command
    Destination: 0xdda4
    Source: 0x3f83
    Radius: 1
    Sequence Number: 133
    Destination: Jennic_00:02:5c:9d:59 (00:15:8d:00:02:5c:9d:59)
    Extended Source: IEEERegi_ff:fe:61:7a:ba (64:62:66:ff:fe:61:7a:ba)
    ZigBee Security Header
    Command Frame: End Device Timeout Request
        Command Identifier: End Device Timeout Request (0x0b)
        Requested Timeout Enumeration: 8 min (3)
        End Device Configuration: 0x00

Response:

ZigBee Network Layer Command, Dst: 0x3f83, Src: 0xdda4
    Frame Control Field: 0x1a09, Frame Type: Command, Discover Route: Suppress, Security, Destination, Extended Source Command
    Destination: 0x3f83
    Source: 0xdda4
    Radius: 1
    Sequence Number: 198
    Destination: IEEERegi_ff:fe:61:7a:ba (64:62:66:ff:fe:61:7a:ba)
    Extended Source: Jennic_00:02:5c:9d:59 (00:15:8d:00:02:5c:9d:59)
    ZigBee Security Header
    Command Frame: End Device Timeout Response, Success
        Command Identifier: End Device Timeout Response (0x0c)
        Status: Success (0)
        Parent Information: 0x07, MAC Data Poll Keepalive, End Device Timeout Request Keepalive, Power Negotiation Supported
            .... ...1 = MAC Data Poll Keepalive: True
            .... ..1. = End Device Timeout Request Keepalive: True
            .... .1.. = Power Negotiation Supported: True

The parent is supporting all possible futures. The swiwtch node description:

ZigBee Application Support Layer Data, Dst Endpt: 0, Src Endpt: 0
    Frame Control Field: Data (0x40)
    Destination Endpoint: 0
    Node Descriptor Response (Cluster ID: 0x8002)
    Profile: ZigBee Device Profile (0x0000)
    Source Endpoint: 0
    Counter: 4
ZigBee Device Profile, Node Descriptor Response, Rev: 22, Nwk Addr: 0x3f83, Status: Success
    Sequence Number: 31
    Status: Success (0)
    Nwk Addr of Interest: 0x3f83
    Node Descriptor
        .... .... .... .010 = Type: 2 (End Device)
        .... .... .... 0... = Complex Descriptor: False
        .... .... ...1 .... = User Descriptor: True
        .... 0... .... .... = 868MHz BPSK Band: False
        ..0. .... .... .... = 902MHz BPSK Band: False
        .1.. .... .... .... = 2.4GHz OQPSK Band: True
        0... .... .... .... = EU Sub-GHz FSK Band: False
        Capability Information: 0x80
            .... ...0 = Alternate Coordinator: False
            .... ..0. = Full-Function Device: False
            .... .0.. = AC Power: False
            .... 0... = Rx On When Idle: False
            .0.. .... = Security Capability: False
            1... .... = Allocate Short Address: True
        Manufacturer Code: 0x1021
        Max Buffer Size: 89
        Max Incoming Transfer Size: 63
        Server Flags: 0x2c00
            .... .... .... ...0 = Primary Trust Center: False
            .... .... .... ..0. = Backup Trust Center: False
            .... .... .... .0.. = Primary Binding Table Cache: False
            .... .... .... 0... = Backup Binding Table Cache: False
            .... .... ...0 .... = Primary Discovery Cache: False
            .... .... ..0. .... = Backup Discovery Cache: False
            .... .... .0.. .... = Network Manager: False
            0010 110. .... .... = Stack Compliance Revision: 22
        Max Outgoing Transfer Size: 63
        Descriptor Capability Field: 0x00
            .... ...0 = Extended Active Endpoint List Available: False
            .... ..0. = Extended Simple Descriptor List Available: False

Its one end device and radio off then idle = one real sleeper. ZHA configure long pull interval 6 seconds:

ZigBee Application Support Layer Data, Dst Endpt: 1, Src Endpt: 1
    Frame Control Field: Data (0x00)
    Destination Endpoint: 1
    Cluster: Poll Control (0x0020)
    Profile: Home Automation (0x0104)
    Source Endpoint: 1
    Counter: 210
ZigBee Cluster Library Frame
    Frame Control Field: Cluster-specific (0x01)
    Sequence Number: 85
    Command: Set Long Poll Interval (0x02)
    New Long Poll Interval: 24

24 * 1/4 sec = 6 sec ZHA ending fast pull after checking with command Fast Poll Stop => long pull interval.

ZigBee Application Support Layer Data, Dst Endpt: 1, Src Endpt: 1
    Frame Control Field: Data (0x00)
    Destination Endpoint: 1
    Cluster: Poll Control (0x0020)
    Profile: Home Automation (0x0104)
    Source Endpoint: 1
    Counter: 212
ZigBee Cluster Library Frame
    Frame Control Field: Cluster-specific (0x01)
    Sequence Number: 89
    Command: Fast Poll Stop (0x01)

And the device is pulling its parent ??

No. Time    PAN Protocol    IEEE Src    IEEE Dst    Zigbee Src  Zigbee Dst  ZBN Dst Group Nr    ZBN Seq ZBA Seq ZDP Seq Nwk Seq Src EP  Dst EP  Info
2894    15:12:58,999096 0x74f1  IEEE 802.15.4   0x3f83  0xdda4  0x3f83  0xdda4                      94          Data Request
2895    15:12:59,021161 0x74f1  IEEE 802.15.4   0xdda4  0x3f83  N/A N/A                     94          Ack
2943    15:13:00,082205 0x74f1  IEEE 802.15.4   0x3f83  0xdda4  0x3f83  0xdda4                      95          Data Request
2944    15:13:00,082532 0x74f1  IEEE 802.15.4   0xdda4  0x3f83  N/A N/A                     95          Ack
2992    15:13:00,559708 0x74f1  IEEE 802.15.4   0x3f83  0xdda4  0x3f83  0xdda4                      96          Data Request
2993    15:13:00,560481 0x74f1  IEEE 802.15.4   0xdda4  0x3f83  N/A N/A                     96          Ack
3007    15:13:01,096412 0x74f1  IEEE 802.15.4   0x3f83  0xdda4  0x3f83  0xdda4                      97          Data Request
3008    15:13:01,097597 0x74f1  IEEE 802.15.4   0xdda4  0x3f83  N/A N/A                     97          Ack
3018    15:13:01,638609 0x74f1  IEEE 802.15.4   0x3f83  0xdda4  0x3f83  0xdda4                      98          Data Request
3019    15:13:01,639430 0x74f1  IEEE 802.15.4   0xdda4  0x3f83  N/A N/A                     98          Ack
3026    15:13:02,184887 0x74f1  IEEE 802.15.4   0x3f83  0xdda4  0x3f83  0xdda4                      99          Data Request
3027    15:13:02,185962 0x74f1  IEEE 802.15.4   0xdda4  0x3f83  N/A N/A                     99          Ack
3037    15:13:02,741087 0x74f1  IEEE 802.15.4   0x3f83  0xdda4  0x3f83  0xdda4                      100         Data Request
3038    15:13:02,742125 0x74f1  IEEE 802.15.4   0xdda4  0x3f83  N/A N/A                     100         Ack

2 times / seconds !!! I think you is needing some Ah battery for getting the device not dying in some days ;-))

Can you doing one new sniff then pairing with legrand GW but little longer so i can see if its doing the pulling with our without configuring of the chicken and pull control functions ?

PS: you can using filter in wireshark for only see frames to and from one device like this: (wpan.src16 == 0x3f83) || (wpan.dst16 == 0x3f83) in the 802.15.4 layer so you also see the requests and acks in 15.4

Edit: Its 2 request / second then its also one replay / ack from the parent.

stevenwfoley commented 1 year ago

Thanks for looking at the captures! And yeah, some massive batteries would solve the problem for sure :) Attached is another capture of a join on the legrand coordinator, but for a much longer span of time.

long capture.zip

MattWestb commented 1 year ago

The switch is reading some very custom attribute on the coordinator and the coordinator is doing the same but is not setting up and strange things only binding reliant clusters. The coordinator is having one little unushal working mode then is short address is 0x4dcf and then the device is asking matching node discretion for OTA (asking if some have OTA server function) its saying 0x000 and the switch is asking for long address of 0x0000 and the coordinator is is sending one IEEE that is not have but is listening to it then getting commands sent to 0x0000 but i think its one combined ZLL working mode with distributed security and coordinator.

Then all is done its pulling its parent 2 time in the second for long time also after is have getting one fast pull stop command. Then its being silent 35 minutes = its not OK then its shall pulling its parent at least in 8 minutes and it have saying in the end device timeout request then joining (the best practice is saying shorter then the half time if one package is missing). Then its rejoining the system and is being updated and looks working "normal".

No. Time    PAN Protocol    IEEE Src    IEEE Dst    Zigbee Src  Zigbee Dst  ZBN Dst Group Nr    ZBN Seq ZBA Seq ZDP Seq Nwk Seq Src EP  Dst EP  Info
864 23:04:06,575987 0xf6d8  IEEE 802.15.4   0x3e51  0x4dcf  0x3e51  0x4dcf                      147         Data Request
865 23:04:06,576876 0xf6d8  IEEE 802.15.4   0x4dcf  0x3e51  N/A N/A                     147         Ack
866 23:04:07,118681 0xf6d8  IEEE 802.15.4   0x3e51  0x4dcf  0x3e51  0x4dcf                      148         Data Request
867 23:04:07,121547 0xf6d8  IEEE 802.15.4   0x4dcf  0x3e51  N/A N/A                     148         Ack
868 23:04:08,202838 0xf6d8  IEEE 802.15.4   0x3e51  0x4dcf  0x3e51  0x4dcf                      149         Data Request
869 23:04:08,204130 0xf6d8  IEEE 802.15.4   0x4dcf  0x3e51  N/A N/A                     149         Ack
870 23:04:08,761875 0xf6d8  IEEE 802.15.4   0x3e51  0x4dcf  0x3e51  0x4dcf                      150         Data Request
871 23:04:08,763097 0xf6d8  IEEE 802.15.4   0x4dcf  0x3e51  N/A N/A                     150         Ack
873 23:04:09,310448 0xf6d8  IEEE 802.15.4   0x3e51  0x4dcf  0x3e51  0x4dcf                      151         Data Request
874 23:04:09,312088 0xf6d8  IEEE 802.15.4   0x4dcf  0x3e51  N/A N/A                     151         Ack
2354    23:40:02,144922 0xf6d8  ZigBee  0x3e51  0x4dcf  0x3e51  0x4dcf  0x4dcf      227         66          Rejoin Request, Device: 0x3e51

You dont need posting the sniff but im interesting to knowing how the switch is working in loner time with the legrand GW. Is it pulling its parent 2 time in the seconds all the time or is it taking pauses ? And if how long is the pauses ? and is it rejoining the network more times in long time (normally never of not battery is out and getting one new but i think 24 hours it shall not do it).

If is not working stable with legrand GW we properly cant getting it working OK in ZHA. If it being stable we must looking more on the attribute its reading from the GW and the GW is reading from the switch but is useless if its not working stable with legrand GW.

I think the firmware is having serous bugs in the pull control part then its dont respecting the set long pull interval after fast pull stop or checking time out but need to see how is doing in the long run.

stevenwfoley commented 1 year ago

I have a dimmer that's been on the legrand GW for over a year, it works perfectly, and its still at 100% battery. It appears that the dimmer goes into deep sleep and never communicates with the GW unless you push a button on the dimmer. I ran a capture for over 24 hours and I did not see a single packet until I pushed a button on the dimmer. I would expect this from an SED though, especially one that never has the need to receive data from the GW. The relationship between the GW and a SED is very one directional.

MattWestb commented 1 year ago

I agree with you with "real" SED but its depends how its configured. If the "24 hour switch" is having the same config as the sniffed one its it not doing it OK. The switch was requesting end device timeout with 8 min so it shall doing 15.4 data request from its parent often as 8 min and the best under 4 if one frame is being misses pr is its parent flagging it as off line and deleting it form its child table and the network cant finding it = offline. (for sniffing the 15.4 pulls and acks you must having the sniffer in near of the device then its not relayed to other routers like commands to and from the coordinator and the device)

I is testing one old firmware for IKEA OnOff switch and its ZLL so its not doing checkins then it was not implanted in the first ZLL standard but is doing pulling if its parent for not being flagged offline. With latest ZB3 its doing checking to the coordinator around 55 min and no other things then its not needing it and if the system need sending somthing to it its putting in the queer and sending it to the device then its doing next checkin to the coordinator. ZHA is having one record in the log every 50 min with the ckeckin from the device that is very nice to see its working OK.

Back to the Legrand GW long sniff: Then the "42 hour switch" we is not knowing how it was configured so its not easy getting ZHA doing the same config of it for getting it sleeping well (it can have getting one different config then its different firmware in the device and GW then it was paired). That way it was interesting see how the "fast pulling switch" was going in deep sleep after the configure we have seen in the sniff (it was not doing it).

So we still need sniffing one switch joining Legrand GW and setting it up and see that the switch is going in deep sleep and then tying getting ZHA configuring the same way and getting it going in deep sleep and not have it pulling its parent 2/second and rejoining the network all the time.

Also i knowing how ZB3 shall setting up device in the right way but with older Zigbee HA 1.X and ZLL 1.X is little more tricky then it was not so strict as ZB3 is and its more well documented and can looking how other system is doing it by sniffing then.

How shall we going forward ? Im curious how Legrand QW is setting up the switch for getting it going in deep sleep and working OK for getting it working OK with ZHA.

stevenwfoley commented 1 year ago

Thank you so much for your continued help! Sorry for the delayed response, we are moving to a new house. According to legrand, all of their modern devices are running ZB3 firmware. I'm not sure where to go from here either, but I'm willing to try anything I can. Do you have any suggestions for what I can try to narrow down the specifics of the issue?

MattWestb commented 1 year ago

One universal way on EZSP to fixing battery daring controllers is to blocking the coordinator having end device as children. The problem is normally if the end device is trying communicating and the coordinator is offline or the host system is having problems they is jumping to one new router if they can (Xiaomi cant they leaving the network). Way is helping = IKEA was having and its looks many other like HUE bugs that was triggered of end device jumping and killing the battery.

I was thinking you can trying the same then you have time and getting the things in place in you new home then it can its helping in your system to with end device problems.

Also making more sniffing for try catching what is going wrong but its no hurry from my side and you need your time for more importig things for the moment.

stevenwfoley commented 1 year ago

I know I've read about and seen the config entries necessary for limiting the children on the coordinator, but I'm not able to find it again. Can you please remind me of the config entries?

MattWestb commented 1 year ago

Part if my config:

zha:
  custom_quirks_path: /config/custom_zha_quirks/
#    handle_unknown_devices: yes
  zigpy_config:
    network:
#      key: [11,11,14,14,13,13,12,11,10,10,11,12,13,14,11,11]   ## 16 bytes of network key
      channel: 15
      channels: [11, 15, 20, 25]
      pan_id: 0xA11A
#      extended_pan_id: "DD:DD:DD:DD:DD:DD:DD:DD"
    source_routing: true
    ezsp_config:
      CONFIG_MAX_END_DEVICE_CHILDREN: 0

Little scrambled for not making my neighbors tooo happy. If you is setting up one new network you must having it not zero until have adding the first router or ZHA is accepting it and then you can having zero or what you like and its only take affect then restarting (Z)HA.

stevenwfoley commented 1 year ago

Thanks. I added this entry and restarted HA. Do I need to re-pair the dimmers to ensure they are not connecting to the coordinator, or will the coordinator reject them and force them to find a router?

MattWestb commented 1 year ago

If it was having the coordinator as parent it shall being kicked and joining on other router it can find and you can see in the network map / visualization how its connected (clock on update and waiting little and going back to the map so update network data is current).

If you like you can repairing it or only using it little and see if its working then it shall having connection but not direct with the coordinator.

I only knowing that Aqara / Xiaomi sensors is not jumping they is leaving silent if the parent is offline but all good devices shall jumping then its needed.

stevenwfoley commented 1 year ago

Well the "CONFIG_MAX_END_DEVICE_CHILDREN: 0" setting did not improve battery life.

stevenwfoley commented 1 year ago

FWIW, these devices seem to work as expected with zigbee2mqtt. I switched to z2m last week and the two dimmers I'm testing have had zero battery drain so far. I would ultimately like to stay on ZHA. Can z2m be used to try and figure out how ZHA can be altered to support these legrand wireless devices?

MattWestb commented 1 year ago

Sounds great ! Z2M is setting up the coordinator part little different but the working mode is the same from the network point of view. The interesting is witch parent the dimmer is using. Can you looking on the network map how its connected in the network ?

stevenwfoley commented 1 year ago

The parent for both dimmers is the coordinator. I actually paired them first, before I had any other routers paired, so the coordinator was the only option.

MattWestb commented 1 year ago

The Z2M have getting the EZSP setting / parameters for the working mode very good !! Normally is being problems with them like TI coordinator cant have direct Zigbee 3 children if nit using the last beta firmware in the coordinator.

issue-triage-workflows[bot] commented 1 year ago

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.