coordinator crashing? - Githubissues

chrisdrackett commented 1 year ago

I've been trying to debug this issue for a couple weeks. I started with home assistant and Z2M, but I've been slowly working my way down the line to the point that I think the issue is my CC2652P2 Based Zigbee to PoE Coordinator 2022. It seems like it is going offline at random intervals, and if its offline for long enough (or some other timing works out) Z2M will crash and not auto-restart.

The only way of debugging I currently can think of is pinging the IP of the coordinator. I see the following at times:

64 bytes from 192.168.1.102: icmp_seq=127 ttl=255 time=1.461 ms
64 bytes from 192.168.1.102: icmp_seq=128 ttl=255 time=0.869 ms
64 bytes from 192.168.1.102: icmp_seq=129 ttl=255 time=1.395 ms
64 bytes from 192.168.1.102: icmp_seq=130 ttl=255 time=0.968 ms
64 bytes from 192.168.1.102: icmp_seq=131 ttl=255 time=1.031 ms
Request timeout for icmp_seq 132
64 bytes from 192.168.1.102: icmp_seq=133 ttl=255 time=1.428 ms
64 bytes from 192.168.1.102: icmp_seq=134 ttl=255 time=0.999 ms
Request timeout for icmp_seq 135
64 bytes from 192.168.1.102: icmp_seq=136 ttl=255 time=1.086 ms
64 bytes from 192.168.1.102: icmp_seq=137 ttl=255 time=1.124 ms
Request timeout for icmp_seq 138
Request timeout for icmp_seq 139
64 bytes from 192.168.1.102: icmp_seq=140 ttl=255 time=0.980 ms
64 bytes from 192.168.1.102: icmp_seq=141 ttl=255 time=1.092 ms
64 bytes from 192.168.1.102: icmp_seq=142 ttl=255 time=1.365 ms
Request timeout for icmp_seq 143
Request timeout for icmp_seq 144
Request timeout for icmp_seq 145
Request timeout for icmp_seq 146
Request timeout for icmp_seq 147
Request timeout for icmp_seq 148
Request timeout for icmp_seq 149
Request timeout for icmp_seq 150
Request timeout for icmp_seq 151
64 bytes from 192.168.1.102: icmp_seq=152 ttl=255 time=1.805 ms
64 bytes from 192.168.1.102: icmp_seq=153 ttl=255 time=1.521 ms
64 bytes from 192.168.1.102: icmp_seq=154 ttl=255 time=1.076 ms
64 bytes from 192.168.1.102: icmp_seq=155 ttl=255 time=1.747 ms
64 bytes from 192.168.1.102: icmp_seq=156 ttl=255 time=1.725 ms

so we're only talking a couple seconds at a time. I've tried using different routers and I'm seeing the same on all of them.

Let me know if there is something else I can do to debug.

tube0013 commented 1 year ago

Have you tried the most recent esp32/esphome firmwares? they have some improvements as far a connection reliability.

https://github.com/tube0013/tube_gateways/tree/main/models/current/tubeszb-cc2652-poe-2022/firmware/esphome

chrisdrackett commented 1 year ago

nope, I haven't updated it since I got it as everything was working well. I'll give this a go

chrisdrackett commented 1 year ago

hm, now it seems worse? It starts, but I get a lot of the following in the logs:


21:36:14 | [D] | [esp-idf:000] | E (69413) esp.emac: no mem for receive buffer
-- | -- | -- | --
21:36:14 | [D] | [esp-idf:000] | E (69415) esp.emac: no mem for receive buffer
21:36:15 | [D] | [esp-idf:000] | E (69771) esp.emac: no mem for receive buffer
21:36:15 | [D] | [esp-idf:000] | E (69772) esp.emac: no mem for receive buffer
21:36:15 | [D] | [esp-idf:000] | E (69784) esp.emac: no mem for receive buffer
21:36:15 | [D] | [esp-idf:000] | E (69785) esp.emac: no mem for receive buffer
21:36:15 | [D] | [esp-idf:000] | E (69974) esp.emac: no mem for receive buffer
21:36:15 | [D] | [esp-idf:000] | E (69975) esp.emac: no mem for receive buffer

before I lose the connection entirely and it won't come back till I power cycle.

tube0013 commented 1 year ago

recommend trying to flash over serial so the esp32 gets a full erase to clear out any old vestiges from the old install

https://github.com/tube0013/tube_gateways/tree/main/models/current/tubeszb-cc2652-poe-2022#esp32-flashing

chrisdrackett commented 1 year ago

yeah, I did it over serial (OTA didn't seem documented, and I'm nervous enough as it is :P)

chrisdrackett commented 1 year ago

ok, I flashed it again just to be safe and it seems to be less noisy at least:

21:38:46 | [D] | [api:102] | Accepted ::FFFF:192.168.1.190
-- | -- | -- | --
21:38:46 | [D] | [api.connection:918] | Home Assistant 2023.1.1 (::FFFF:192.168.1.190): Connected successfully
21:39:02 | [D] | [streamserver:074] | New client connected from 192.168.1.190
21:39:02 | [D] | [binary_sensor:036] | 'TubesZB Serial Connected': Sending state ON
21:40:01 | [D] | [esp-idf:000] | E (85541) esp.emac: emac_esp32_receive(244): received buffer longer than expected
21:40:01 | [D] | [esp-idf:000] | E (85543) esp.emac: emac_esp32_receive(244): received buffer longer than expected
21:40:01 | [D] | [esp-idf:000] | E (85554) esp.emac: emac_esp32_receive(244): received buffer longer than expected

I'll keep an eye on it and see if I have any more disconnects

tube0013 commented 1 year ago

okay it's late here, so let me know. the esp-idf lines sort of surprise me. but I may just be tired. as I've been testing some idf stuff today too.

chrisdrackett commented 1 year ago

Sadly I think its actually worse after the update. In the logs I get something like the following after about 2 minutes of power:


07:41:24 | [D] | [esp-idf:000] | E (61813) esp.emac: emac_esp32_receive(244): received buffer longer than expected
-- | -- | -- | --
07:41:25 | [D] | [esp-idf:000] | E (62643) esp.emac: emac_esp32_receive(244): received buffer longer than expected
07:41:25 | [D] | [esp-idf:000] | E (62645) esp.emac: emac_esp32_receive(244): received buffer longer than expected
07:41:25 | [D] | [esp-idf:000] | E (62656) esp.emac: emac_esp32_receive(244): received buffer longer than expected
07:41:25 | [D] | [esp-idf:000] | E (62837) esp.emac: emac_esp32_receive(244): received buffer longer than expected
07:41:25 | [D] | [esp-idf:000] | E (63273) esp.emac: no mem for receive buffer
07:41:25 | [D] | [esp-idf:000] | E (63275) esp.emac: no mem for receive buffer
07:41:25 | [D] | [esp-idf:000] | E (63278) esp.emac: no mem for receive buffer
07:41:25 | [D] | [esp-idf:000] | E (63300) esp.emac: no mem for receive buffer

at which point I lose both the web interface and the connection from Z2M. A restart fixes it, but only for a couple of minutes.

tube0013 commented 1 year ago

does powering over usb (remove from case) and DO NOT use powered ethernet help it run longer?

I can get a replacement Olimex ESP32-PoE module out to you today. - I think with a new case too as the module is now longer than it was when you purchased.

chrisdrackett commented 1 year ago

so I turn off POE on this port and plug in USB?

tube0013 commented 1 year ago

yes. the usb and poe are not isolated from each other so be sure PoE is off.

chrisdrackett commented 1 year ago

I'm seeing the same thing powered by USB. At this point its happening immediately:


07:58:33 | [D] | [api:102] | Accepted ::FFFF:192.168.1.190
-- | -- | -- | --
07:58:33 | [D] | [api.connection:918] | Home Assistant 2023.1.1 (::FFFF:192.168.1.190): Connected successfully
07:58:48 | [D] | [esp-idf:000] | E (25728) esp.emac: emac_esp32_receive(244): received buffer longer than expected
07:58:48 | [D] | [esp-idf:000] | E (25730) esp.emac: emac_esp32_receive(244): received buffer longer than expected
07:58:48 | [D] | [esp-idf:000] | E (25741) esp.emac: emac_esp32_receive(244): received buffer longer than expected
07:58:50 | [D] | [esp-idf:000] | E (27638) esp.emac: no mem for receive buffer
07:58:50 | [D] | [esp-idf:000] | E (27642) esp.emac: no mem for receive buffer
07:58:50 | [D] | [esp-idf:000] | E (27647) esp.emac: no mem for receive buffer
07:58:50 | [D] | [esp-idf:000] | E (27663) esp.emac: no mem for receive buffer

tube0013 commented 1 year ago

I think there is something up with the hardware in the esp. I will get a new one out today. can you just confirm address is same as original order?

in the mean time if you have an USB to serial converter you can wire up the zigbee module to serial to keep you online. the pins are labeled on the underside of the PCB let me know if you need any clarification.

chrisdrackett commented 1 year ago

Yep, address is the same! I don't have a converter, but I do have my old coordinator somewhere still. I think in theory I need to change its ieee or something? I'll look it up…

tube0013 commented 1 year ago

z2m should be able to restore current network to it...just not sure on the conditions it uses. just make a copy of the backup in the zigbee2mqtt folder before doing anything ;)

chrisdrackett commented 1 year ago

man, now I'm just getting garbage when connected via esphome-flasher :\

Serial port closed!
[08:48:28][3m[e22]nting vitt f! Re-connecting...
[08:48:43]ethernet:122Cocting via ettnnectg...0]ht:1: a[3toetn..[m
Serial port closed!
[08:48:43]ethernet:122Cocting via ettnnectg...0]ht:1: a[3toetn..[m
[08:48:58]3WtConeca e;3[[ent12:Cnei i tentfie!R-ncig.0m[W[ehret12:Connecngvia etret failed! Re-con[0m
[08:49:13]ten:2] ti hennecting...3mrnt12] oncti hre aecting...
[08:49:28][0;3[W][ethee:2:Cneng viaetentfiled! Re-conn[03mW[ Re-con.
Serial port closed!
[08:49:28][0;3[W][ethee:2:Cneng viaetentfiled! Re-conn[03mW[ Re-con.

tube0013 commented 1 year ago

never seen that. it either works on not in the cases where they fail in my experience.

chrisdrackett commented 1 year ago

hm, FYI (and I'm not sure why I did this) but I updated the firmware on my CC2652P2 based Zigbee to Ethernet/USB Serial Coordinator before using it and I'm seeing the same no mem for receive buffer in my logs.

chrisdrackett commented 1 year ago

I just rolled back to tube-zb-gw-cc2652p2-v2_2021_09_17.bin and I'm not seeing the odd lines in my log. 🤞I don't see the other issues either!

chrisdrackett commented 1 year ago

ok, nevermind, just got E (40079) esp.emac: no mem for receive buffer again. Could something else be going on? Is this a HASS or Z2M issue after all?

tube0013 commented 1 year ago

very strange been usign the new FW for months now with never seen anything like this.

I compiled a new Dev FW right now, that is based on the -IDF framework can you give this a try on the PoE module? flash over serial. tubeszb-cc2652-poe-2022-idf-20230203.bin.zip

tube0013 commented 1 year ago

they are showing in the esphome log as I understand it, so definitely think it's focused there. do you have a different switch you could try - did anything change on the switch side recently?

chrisdrackett commented 1 year ago

I recently got an aggregation switch to reduce daisy chaining, but the coordinator has been plugged into the same switch and port the entire time I've owned it.

chrisdrackett commented 1 year ago

getting the same thing. I'm racking my brain but can't think of anything that has changed, especially in the last day or so when it went from dying every week to dying within hours/minutes/seconds.

chrisdrackett commented 1 year ago

ok, I was wrong, I do have Z2M set to auto-update (something I generally try to avoid). My current guess is something changed there, because everything else hasn't changed in months.

tube0013 commented 1 year ago

I would be hearing from A LOT of people if the z2m update caused issues. I'm leaning toward Network/Power or hardware. I'll try and do some more research on my side as well.

chrisdrackett commented 1 year ago

True, just trying to figure out what changed. I'm concerned because I see the same thing on two pieces of hardware, which makes me think its something on my end :(

tube0013 commented 1 year ago

there is a fix for this or a similar error that got merged into ESP-IDF in late November. the version used in esphome is currently before this. and I'm not really sure how the versioning in IDF tracks. but I'm attempting to build with the "Latest" to see what happens.

chrisdrackett commented 1 year ago

ok, I have the POE/USB coordinator plugged into my HASS box via USB at the moment to see what happens. I'm not sure how to see logs from the coordinator in this case (if it is even possible) but I'll keep an eye on Z2M to see how long it stays up

tube0013 commented 1 year ago

If zigbee running over USB the ESP32 plays no role.

do you still want me to ship out the new ESP32-PoE board? maybe worth having on hand to try.

chrisdrackett commented 1 year ago

If you think it would help debug. Because I've seen the same thing on 2 different coordinators I'm not convinced that new hardware will make a difference, but I also have 0 idea what is going on or how these things work :D

tube0013 commented 1 year ago

No problem.

so the ESP-IDF just had a bug fix release(4.4.4) earlier today in the version Esphome uses(4.4.2) and it's the first since that issue I saw had a merge so the tool chains aren't even all updated yet to support it to see if makes a difference. but iI'll keep an eye on it.

chrisdrackett commented 1 year ago

sounds good, I'm still super confused why I just started seeing this recently, so who knows!

chrisdrackett commented 1 year ago

@tube0013 got the new board, do I just move the zigbee module over? do I need to do anything else to make sure I don't break anything on my zigbee network?

tube0013 commented 1 year ago

Just move it over, you will need to update the ip in your config or change the dns reservation for the new board.

chrisdrackett commented 1 year ago

@tube0013 thanks, unfortunately this one is having the same issue. I can only assume this is something on my network / the issue you linked. I'm in no way a networking expert, but seems like something on my network is sending more data / faster data than the ESP can handle?

MattWestb commented 1 year ago

Trying fixing the network port in the switch to 100Mb full duplex.

tube0013 commented 1 year ago

Also what kind of switch are you using?

chrisdrackett commented 1 year ago

its a unifi switch (US-8-150W) forcing it to 100Mb full duplex does not change anything (that already how it connects via the default "auto" setting.

I assume that the coordinator is picking up on some sort of traffic on my network as I don't even currently have this device setup to connect to Z2M and its still crashing almost immediately.

tube0013 commented 1 year ago

okay this is just really odd. I do all my testing with unifi switches (and router). do you want to try a static IP build?

chrisdrackett commented 1 year ago

I'm happy to do anything to help debug! I'm currently trying to figure out how to setup port mirroring and wireshark, but I'll probably have little idea what I'm seeing :P

tube0013 commented 1 year ago

grab the yaml from here:

https://github.com/tube0013/tube_gateways/tree/main/models/current/tubeszb-cc2652-poe-2022/firmware/esphome

and drop it in esphome folder (if you don't have it let me know)

uncomment the static ip stuff and fill it out ( send me the ip and gateway ip if you don't have esphome).

chrisdrackett commented 1 year ago

ok, I just setup a network with multicast turned off and added the coordintator to that network. So far it's been up for minutes, so my current guess is that something apple is doing with multicast and the culprit (I don't have any other devices on my network that use multicast that I'm aware of)

tube0013 commented 1 year ago

I have multicast and IGMP enabled. as I have all my google homes on a separate vlan otherwise they have issues (at least that's my memory of why I did it).

chrisdrackett commented 1 year ago

still up without any log output on the device. I think we (kinda) found the issue. I'm reading up on multicast, but that isn't your issue ;)

do you want me to mail this module back?

tube0013 commented 1 year ago

yeah, did I include a return envelope/label?

tube0013 / tube_gateways

coordinator crashing? #124