jethome-ru / zigbee-firmware

128 stars 27 forks source link

CC2538 UART coordinator loses pairing after power loss #13

Closed tsmt09 closed 3 years ago

tsmt09 commented 3 years ago

Hello everyone

First of all I don't want to have a "fix this" post here, I would just love to have your advice with a problem I currently have.

we currently have the problem that our UART coordinators lose all pairings to devices when the UART controller has a power loss.

I myself already did the following to debug the problem and maybe solve it:

This works fine, I'm able to replicate the issue and now I'm investigating the ZStack code. So my build toolchain to work on this is fine.

We have a longer discussion going on the homegear forums. here's a testing table from @codmpm:

Bildschirmfoto von 2021-02-27 15-02-18

This post currently suggest, that this pairing-loss is only happening with the UART version of the firmware, USB is working fine. Unfortunately it's in german.

I already diffed the USB and UART patch but found no clue where exactly the UART has different reasons which could lead to this behaviour.

I'm pretty new to the whole codebase. Do you guys have any Idea why this is happening and maybe can point me to a part of the ZStack code which could be the reason for this?

If homegear forum doesn't give you public access, I'm happy to copy the posts here - just ping me about it.

michalszymura commented 3 years ago

same thing happens on mine flashed with usb firmware, I'm starting to think my cc2538 might be faulty

TroLoos commented 3 years ago

Yeah, got a power loss today @ night. And guess what - all of my devices are disconnected.

Looking at some other threads I found that it is recommended to run "zStackEraseAllNvMem.js" script because this issue may be connected to nvram issue.

I've run the script, it erased what was supposed to erase.

Right now z2m software doesn't start. I suppose I will have to re-flash my UART CC2538. It just doesn't seem like a fix to me :-(.

codmpm commented 3 years ago

I can confirm that on power-loss the USB as well as the UART version suffer from the same problem.

I've hacked together a USB version with our CC2538 UART module and as soon as the power from the Pi does go down or I simply unplug the USB I have to repair all devices like wall sockets and so on. A paired trafdri two way switch works everytime, even after power loss.

IMG_8713 Kopie

Let me know what else I can test.

tsmt09 commented 3 years ago

Hey guys, thanks for your contributions here.

that's great news, or maybe not since now there's need to check why this is happening. But it's happening for both the USB and UART version, there's no need for me anymore to look for differences between both of them. One path of investigation I can actually skip. I will continue research, maybe I find something inside of the 2538 docs.

michalszymura commented 3 years ago

I will solder up another one tommorow and do some testing. I was thinking it could be something to do with docker. Do You run it in docker or not? EDIT: I mean zigbee2mqtt or whatever You are using.

codmpm commented 3 years ago

Do You run it in docker or not?

No, only running on plain Raspbian.

tsmt09 commented 3 years ago

Same.

Do You run it in docker or not?

No, only running on plain Raspbian.

michalszymura commented 3 years ago

Then it's the same on docker and non docker setups. I was hoping it was docker messing with things.

0x3EC commented 3 years ago

@tsmt09 Is there a similar problem here? https://github.com/Koenkk/zigbee2mqtt/issues/6409 Can zigbee sniffer logs be seen? And the cordinator_backup.json file

michalszymura commented 3 years ago

This is my coordinator_backup. https://pastebin.com/qFjLwzc8 I'll post sniffer logs soon, just need to find my cc2531.

0x3EC commented 3 years ago

@michalszymura Your NIB table is 110 bytes in size. For CC2538, the table must be 116 bytes.

The surest way to fix the problem is to delete the old backups and create a new network on the CC2538 stick.

If you find a sniffer and you have the time / opportunity, we can try to repair your current network.

hacker-cb commented 3 years ago

As I remember, it was mistake in old z2m backups. So it will be better to try at least latest release before reporting any problem.

michalszymura commented 3 years ago

@0x3EC My network is currently setup from scratch so that is kinda weird. I did reflash my CC2538 and setup new pan id / ext pan id. I deleted all backups and then paired everything. I just found CC2531 I'll flash it and do the sniffing. Also I'm using latest dev version of zigbee2mqtt.

0x3EC commented 3 years ago

@michalszymura Do this:

  1. Stop z2m
  2. Delete z2m backups
  3. Clean the NV_MEM script https://github.com/Koenkk/zigbee2mqtt/blob/master/scripts/zStackEraseAllNvMem.js
  4. Cold restart stick 2538
  5. Start z2m
  6. Set up a new network, add devices.
  7. Stop z2m and check that the NIB table is 116 bytes. If so, then your network will work stably.

Use the firmware for 2538 20201010.

michalszymura commented 3 years ago

@0x3EC I will try this in the morning as I need working lights right now ;) I did a capture, but that is probably not useful right now.

zigbee_capture.zip

michalszymura commented 3 years ago

Right now NIB table is 116 in the backup, will report if anything changes.

0x3EC commented 3 years ago

@michalszymura Is there still a problem?

tsmt09 commented 3 years ago

@michalszymura Do this:

1. Stop z2m

2. Delete z2m backups

3. Clean the NV_MEM script https://github.com/Koenkk/zigbee2mqtt/blob/master/scripts/zStackEraseAllNvMem.js

4. Cold restart stick 2538

5. Start z2m

6. Set up a new network, add devices.

7. Stop z2m and check that the NIB table is 116 bytes. If so, then your network will work stably.

Use the firmware for 2538 20201010.

Good news from my side als also BIIIIG thank you! This seems to have solved our problem. I did what you requested up there and after the restart my coordinator backup NIB had 116 length. Before it was definitely 110.

Then, I accidentially unplugged the power of my raspi (2538 on Raspi head) and after rebooting, my temperature sensor was still there. Very good! It never did that before.

I added the following devices:

All of them work after a cold restart, removing power for like a minute or more. The actor is also working as well.

I will just leave it offline for some hours and retest tomorrow if devices are still there.

@0x3EC, what do you think is the problem here? Is the 2538 firmware initialized with wrong data? I'm asking because I use the exact same controller with the same, freshly downloaded firmware, same z2m version. Just want to understand the problem. I'm really happy and thankful you helped us with that. Much appreciation.

codmpm commented 3 years ago

@0x3EC thank you so much.

Can confirm that it works.

"len": 110 in coordinator_backup.json at ZCD_NV_NIB reproducing the described issues.

After executing node scripts/zStackEraseAllNvMem.js /dev/ttyAMA0, deleting coordinator_backup.json and repairing some devices I have "len": 116 and the CC2538 "survives" a cold start without loosing the pairing. Wohooo!!! 🤗

Albeit on some modules I had this problem:

Detected znp version 'zStack30x' ({"transportrev":2,"product":2,"majorrel":2,"minorrel":7,"maintrel":2,"revision":20201010})
Clearing all NVMEM items, from 0 to 831
NVMEM item #1 - deleting, size: 8
NVMEM item #2 - deleting, size: 2
[...]
NVMEM item #59 - deleting, size: 17
NVMEM item #60 - deleting, size: 1
(node:632) UnhandledPromiseRejectionWarning: Error: SREQ '--> SYS - osalNvDelete - {"id":60,"len":1}' failed with status '(0x0a: NV_OPER_FAILED)' (expected '(0x00: SUCCESS),(0x09: NV_ITEM_INITIALIZED)')
    at Znp.<anonymous> (/opt/zigbee2mqtt/node_modules/zigbee-herdsman/dist/adapter/z-stack/znp/znp.js:291:27)
    at Generator.next (<anonymous>)
    at fulfilled (/opt/zigbee2mqtt/node_modules/zigbee-herdsman/dist/adapter/z-stack/znp/znp.js:24:58)
(node:632) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:632) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

and had to restart the script several times till it finally went through. Did this with the 3.3V from the Pi as well as 3.3V derived from the 5V with a HT7333 (see pic above).

Thank you so so much.

Cheers.

michalszymura commented 3 years ago

@0x3EC Seems to be behaving correctly. I was able to unplug the stick and plug it back in and didn't have any issues.

tsmt09 commented 3 years ago

I put a longer and detailed tutorial to reset the NIB table with zigbee2mqtt.

https://github.com/codm/cc2538-fix#english

Didn't have time yet to build something which does that directly on the serial port, but the tutorial should to it for now.

I would close the issue soon, if there are no objections.