syssi / esphome-zb-gw03

ESPHome custom firmware for the ZB-GW03 zigbee gateway
Apache License 2.0
72 stars 18 forks source link

Watchdog heartbeat timeout #8

Closed dartanidi closed 1 year ago

dartanidi commented 2 years ago

@syssi first of all thank you for making this Esphome firmware possible for the zb-gw03.

I moved from a Sonoff Zbbridge where I had sometime some connection stability problems looking for a more reliable solution. Unfortunately this seems totaly unstable, with bellows reporting a lot of watchdog heartbeat timeouts in the log.

It eventually stops working with HA reporting errors like this:

Error doing job: Exception in callback ThreadsafeProxy.__getattr__.<locals>.func_wrapper.<locals>.check_result_wrapper() at /usr/local/lib/python3.9/site-packages/bellows/thread.py:97
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.9/site-packages/bellows/thread.py", line 98, in check_result_wrapper
    result = call()
  File "/usr/local/lib/python3.9/site-packages/bellows/ezsp/__init__.py", line 258, in frame_received
    self._protocol(data)
  File "/usr/local/lib/python3.9/site-packages/bellows/ezsp/protocol.py", line 138, in __call__
    frame_name = self.COMMANDS_BY_ID[frame_id][0]
KeyError: 520

which I read are tipical in the ZB to WiFi bridges. So it looks like a connection problem, but HA PC and the bridge are connected via the same switch, and the bridge is never reported to be unavailable.

Toggling the Zigbee Reset switch solves the issue, but this problems occurs say every hour or two. Any clue? Thanks for your help.

dartanidi commented 2 years ago

I forgot to say I'm using your example.yaml and Esphome v2021.12.1

syssi commented 2 years ago

I haven't seen this issue before. My setup is running 10 days++ without issues. I will try to provide some ideas to isolate the cause:

  1. Please replace the power supply of the gateway to rule out power issues.
  2. The connection between the bridge and HA core is wired and based on a TCP connection. Packet loss on stream/socket level cannot be an issue. If it's a connection issue (reset of the first connection + reconnect + dead-lock in some state) it should be visible on network level. Let's use wireshark/tcpdump to capture the traffic between the bridge and HA core (port 6638 should be sufficient). You can stop the capture as soon the issue occurs. Could you provide this capture?
  3. Keep in mind: It's a serial connection between the zigbee module and the ESP32. If the connection between the ESP32 and the EFR32MG21 is bad the bridge will deliver garbage too.
dartanidi commented 2 years ago

Thanks for replying syssi.

  1. I tried without luck to use a differen power supply, so it's not the case.
  2. I'm not able to sniff any packet (except MDNS and broadcast) from/to the bridge using wireshark; I have limited experience in it. Anyway looking at the ESPhome log I found that the streamingserver is reportin some client disconnections and further reconnections with the HA IP. I set the logger to very verbose to check if there is something more.

One question: have you tried your setup with the latest ESPhome version (like in my case)? Thanks again for your help.

syssi commented 2 years ago

I'm not able to sniff any packet (except MDNS and broadcast) from/to the bridge using wireshark; I have limited experience in it. Anyway looking at the ESPhome log I found that the streamingserver is reportin some client disconnections and further reconnections with the HA IP. I set the logger to very verbose to check if there is something more.

If your network is switched you can only see broadcasts if you use a third host to sniff traffic. You have to make sure to capture the traffic at the network card of your HA instance (using tcpdump -i eth0 -s0 -w traffic.pcap). But it looks like this isn't necessary anymore because you already know the root-cause: Connection issues.

I've updated my setup vom 2021.11.4 to esphome 2021.12.2 now. I will report tomorrow it's stable or not.

dartanidi commented 2 years ago

Update: I flashed the bridge with the custom tasmota firmware from thehelpfulidiot, and seems to be working flawlessly. If this is the case, looks like the stream-server component could have issues with the very recent versions of ESPhome.

I will keep you updated if the actual setup will continue working. Thanks anyway for your help, much appreciated!

syssi commented 2 years ago

You are right. I can see multiple reconnects using 2021.12.2:

[17:21:57][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:21:58][D][streamserver:106]: New client connected from 192.168.1.100
[17:25:37][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:25:37][D][streamserver:106]: New client connected from 192.168.1.100
[17:28:04][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:28:05][D][streamserver:106]: New client connected from 192.168.1.100
[17:35:39][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:35:40][D][streamserver:106]: New client connected from 192.168.1.100
[17:40:00][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:40:00][D][streamserver:106]: New client connected from 192.168.1.100
[17:40:12][D][streamserver:106]: New client connected from 192.168.1.100
[17:42:49][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:42:49][D][streamserver:106]: New client connected from 192.168.1.100
[17:46:59][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:46:59][D][streamserver:106]: New client connected from 192.168.1.100
[17:48:25][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:48:26][D][streamserver:106]: New client connected from 192.168.1.100
[17:50:12][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:50:12][D][streamserver:106]: New client connected from 192.168.1.100
[17:52:08][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:52:09][D][streamserver:106]: New client connected from 192.168.1.100
[17:53:35][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:53:35][D][streamserver:106]: New client connected from 192.168.1.100
[17:57:24][D][streamserver:055]: Client 192.168.1.100 disconnected
[17:57:25][D][streamserver:106]: New client connected from 192.168.1.100
[18:06:11][D][streamserver:055]: Client 192.168.1.100 disconnected
[18:06:11][D][streamserver:106]: New client connected from 192.168.1.100
[18:12:24][D][streamserver:055]: Client 192.168.1.100 disconnected
[18:12:24][D][streamserver:106]: New client connected from 192.168.1.100
[18:22:33][D][streamserver:055]: Client 192.168.1.100 disconnected
[18:22:33][D][streamserver:106]: New client connected from 192.168.1.100
[18:27:34][D][streamserver:055]: Client 192.168.1.100 disconnected
[18:27:35][D][streamserver:106]: New client connected from 192.168.1.100
[18:29:21][D][streamserver:055]: Client 192.168.1.100 disconnected
[18:29:21][D][streamserver:106]: New client connected from 192.168.1.100

I've flashed the device using esphome 2021.11.4 again to verify this version is better for real.

syssi commented 2 years ago

The log message at the esphome node are the same on 2021.11.4:

[18:45:43][D][streamserver:106]: New client connected from 192.168.1.100
[18:53:17][D][streamserver:055]: Client 192.168.1.100 disconnected
[18:53:17][D][streamserver:106]: New client connected from 192.168.1.100
[18:57:37][D][streamserver:055]: Client 192.168.1.100 disconnected
[18:57:37][D][streamserver:106]: New client connected from 192.168.1.100
[19:23:18][D][streamserver:055]: Client 192.168.1.100 disconnected
[19:23:19][D][streamserver:106]: New client connected from 192.168.1.100
[19:27:18][D][streamserver:055]: Client 192.168.1.100 disconnected
[19:27:19][D][streamserver:106]: New client connected from 192.168.1.100
[19:32:50][D][streamserver:055]: Client 192.168.1.100 disconnected
[19:32:51][D][streamserver:106]: New client connected from 192.168.1.100
[19:36:09][D][streamserver:055]: Client 192.168.1.100 disconnected
[19:36:09][D][streamserver:106]: New client connected from 192.168.1.100
[19:52:46][D][streamserver:055]: Client 192.168.1.100 disconnected
[19:52:47][D][streamserver:106]: New client connected from 192.168.1.100
[19:55:55][D][streamserver:055]: Client 192.168.1.100 disconnected
[19:55:55][D][streamserver:106]: New client connected from 192.168.1.100
[20:04:42][D][streamserver:055]: Client 192.168.1.100 disconnected
[20:04:42][D][streamserver:106]: New client connected from 192.168.1.100
[20:14:50][D][streamserver:055]: Client 192.168.1.100 disconnected
[20:14:50][D][streamserver:106]: New client connected from 192.168.1.100
dartanidi commented 2 years ago

Could it be possible that the ESPhome directories need to be cleaned before downgrading the files? Otherwise it doesn't have much sense since previously it was working

syssi commented 2 years ago

May be my HA core instance is closing the connection intentionally. I will try to monitor the HA core log as next step.

syssi commented 2 years ago

At the home-assistant.log I can see the warnings now:

2021-12-31 10:26:01 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2021-12-31 10:26:16 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2021-12-31 10:26:31 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2021-12-31 10:26:46 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2021-12-31 10:27:01 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout:

I will capture some traffic to get a better understanding who drops the connection.

syssi commented 2 years ago

@dartanidi Do you use ZHA or Z2M now?

dartanidi commented 2 years ago

I'm using ZHA, btw with the tasmota custom firmware it's working perfectly, no errors are reported in the log.

I still suspect there is an issue within the stream-server component, I'm reading the issues in oxan's repository and he said the component should have to be rewritten using the new socket abstraction instead of async_tcp

syssi commented 2 years ago

I assume you are on the right track. Let's see! :-)

syssi commented 2 years ago
2021-12-31 11:57:22 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2021-12-31 11:57:37 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2021-12-31 11:57:52 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2021-12-31 11:58:07 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2021-12-31 11:58:22 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2021-12-31 11:58:24 WARNING (bellows.thread_0) [bellows.uart] Reset future is None
2021-12-31 11:58:24 ERROR (MainThread) [homeassistant] Error doing job: Exception in callback ThreadsafeProxy.__getattr__.<locals>.func_wrapper.<locals>.check_result_wrapper() at /srv/homeassistant/lib/python3.8/site-packages/bellows/thread.py:97
Traceback (most recent call last):
  File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/srv/homeassistant/lib/python3.8/site-packages/bellows/thread.py", line 98, in check_result_wrapper
    result = call()
  File "/srv/homeassistant/lib/python3.8/site-packages/bellows/ezsp/__init__.py", line 258, in frame_received
    self._protocol(data)
  File "/srv/homeassistant/lib/python3.8/site-packages/bellows/ezsp/protocol.py", line 138, in __call__
    frame_name = self.COMMANDS_BY_ID[frame_id][0]
KeyError: 520
dartanidi commented 2 years ago

Exactly what was happening to me, see first post

syssi commented 2 years ago

In short: I'm able to reproduce the issue now. ;-)

I captured some traffic. It looks like the connection is terminated by the watchdog because of no(?) serial/ezsp traffic.

https://github.com/zigpy/bellows/blob/dev/bellows/zigbee/application.py#L44-L50 https://github.com/zigpy/bellows/blob/dev/bellows/zigbee/application.py#L696-L752

syssi commented 2 years ago

Quick update: It looks like if you downgrade esphome to 2021.9.3 the old stream server implementation is used. This implementation didn't throw any errors in about 6 hours.

alelucc commented 2 years ago

Hi, i'm having this same issue but i am unable to easily rollback to 2021.9.3. Is there any new solution to this issue? I was running 2021.12.3 and i was able to recover the 2021.11.1 backup and same issue.

Thanks

syssi commented 2 years ago

It's pretty easy to install an old esphome version if you have access to a linux host. Do you need some instructions?

alelucc commented 2 years ago

If you are so kind to provide to me it wold be great!! Thanks!!!! Btw i can imagine that i have to install the 2021.9.3 version on my linux distro and "update" the coordinator with that version, right?

syssi commented 2 years ago

Yes. Just pick a linux machine, make sure the python package manager pip3 is available. If not please install the package manager. On debian/ubuntu it looks like this: sudo apt install python3-pip. As soon the python package manager is available you are able to install any esphome version by:

pip3 install esphome==2021.9.3

It's possible the path to the esphome CLI command isn't part of your $PATH. Just call the command with the full path:

~/.local/bin/esphome version

This should return 2021.9.3. Now you are able to install/flash your gateway:

~/.local/bin/esphome run your-configuration.yaml

You could upgrade the esphome install later on to the most recent version:

pip3 install esphome -U

But keep in mind to downgrade the version again if you want to update/improve your gateway configuration.

alelucc commented 2 years ago

Thank you so much!! It worked!! Now i have to remember to not "Update All" on the esphome dashboard :D Now it's not necessary to reflash the zigbee module, rigth?

syssi commented 2 years ago

Did you flash your ZigBee module ever? It's independent from the ESPHome version. The ZigBee module is equipped with the last flashed firmware.

alelucc commented 2 years ago

Did you flash your ZigBee module ever? It's independent from the ESPHome version. The ZigBee module is equipped with the last flashed firmware.

Yes, i followed the entire guide and the coordinator was working but poorly with EspHome 2021.12.3.

syssi commented 2 years ago

Could you create a new issue first? We should avoid off-topic questions here.

syssi commented 2 years ago

It looks like the issue is gone. This user don't have trouble with a recent esphome version: https://community.home-assistant.io/t/zb-gw03-ewelink-ethernet-zigbee-gateway-now-hacked-with-tasmota-and-esphome-so-can-be-used-via-mqtt-or-as-a-remote-zigbee-lan-adapter-with-home-assistants-zha-integration-or-zigbee2mqtt/341223/59

dartanidi commented 1 year ago

@syssi oxan has updated the stream server component, now your ESPhome firmware works as expected, much better than the Tasmota one. Now your code can be used as is without any issue. Thanks!

syssi commented 1 year ago

Cool! I will update the project as soon as possible. I would be happy about a pull request too!

dartanidi commented 1 year ago

Cool! I will update the project as soon as possible. I would be happy about a pull request too!

There is no no need, your yaml files are working out of the box since the component works exactly as the previous version. I'm testing everything using the latest ESPhome version, and no issues since more than one day