mercenaruss / uzg-firmware

Firmware for ZigStar UZG-01
GNU General Public License v3.0
69 stars 11 forks source link

UZG-01 drops connections while renewing DHCP lease #10

Closed ashishpandey closed 6 months ago

ashishpandey commented 9 months ago

It seems UGZ-01 drops client connections while renewing DHCP leases

This causes disruption to clients like zigbee2mqtt, see related issue here where some people have reproduced this reliably: https://github.com/Koenkk/zigbee2mqtt/issues/20148

Switching UGZ-01 to static IP solves the issue further confirming the diagnosis

ffries commented 8 months ago

Fortunately, I cannot confirm this bug.

I am using Z2M on two locations a pure docker/managed mode on two UZG01 on Ethernet (not WIFI) + USB power (no PoE).

I am using DHCP with a long renewal time (4h30min). Everything is running on electrical backup.

The UZG01 device uptime is 3 days / 5 days (I upgraded firmware 3 days ago) and before I had uptimes of a month. Z2M uptime is one day because I upgraded yesterday.

Capture d’écran du 2024-01-02 22-34-32

Capture d’écran du 2024-01-02 22-42-46

ashishpandey commented 8 months ago

I think UGZ uptime is not the issue here. My UGZ stays up, but the connection between it and zigbee2mqtt drops for me at DHCP lease renew.

I see the Socket Uptime being capped at 1d in above screenshots, where is that limit coming from?

ffries commented 8 months ago

Hello ashishpandey, sorry for my late reply. Socket time is 1 day because I upgraded Z2M on both machines.

Docker shows an uptime of Z2M of several days. Capture d’écran du 2024-01-06 10-04-02

I agree with you that I should look more carefully at logs, so I sysloged everything to a Syslog server. What should I look for? Normally, it should display "error" or something similar.

ffries commented 8 months ago

Here is my error log from the last 48 hours from Syslog: it only shows devices that devices cannot be pinged. This is because they are in a remote location and I did not install a secondary UZG01 in this place. But I don't see any sign of UZG01 failure.

Please note that the UZG01 does not run (yet on PoE) and I am using Zigbee firmware from original device. Versions are shown in my previous answer.

Capture d’écran du 2024-01-06 11-36-51

ashishpandey commented 8 months ago

@ffries, how do you setup UZG to log to a syslog server? I can go back to DHCP mode and try to reproduce the logs, but I don't know how configure syslog

mercenaruss commented 8 months ago

@xyzroe can you have a look at dhcp?

xyzroe commented 8 months ago

@xyzroe can you have a look at dhcp?

what do you want me to watch? you know how many hundreds of devices work all over the world, and there have never been problems with DHCP.

xyzroe commented 8 months ago

I think that @ashishpandey has some problems with his network. This problems make UZG to reconnect to the network, but not successful every time.

ashishpandey commented 8 months ago

@xyzroe indeed, there is some problem with either the network or the UZG, I am just trying to identify what it is. I am not assuming it is a problem with UZG firmware, hardware or something external so far, but the problem exists for sure

If you look at the linked issue from zigbee2mqtt, multiple people have reported this, and narrowed it down to being worked around by switching to static IP on UZG. So we must all have something common going on. It is not isolated to my own network

A problem does not exist until identified / acknowledged / investigated. If it turns out to be something external to UZG, we will all learn something to mitigate it. If it turns out with UZG, the product can improve. Both are good for end users of UZG. How can we investigate and get to the root cause?

If it helps, I am running pfsense as DHCP. Also have hundreds of devices on the network, and don't see anything generally unsatisfactory with the network. But happy to investigate why my UZG becomes unavailable to zigbee2mqtt at DHCP lease renewal time. Some of the other posters in the zigbee2mqtt issue also use pfsense, it's a product used very widely itself

xyzroe commented 8 months ago

I've been using DHCP on Zigstar since release. I don't have any such problems.

but what’s even more surprising is that if I simulate problems with the network (for example, by disconnecting the switch in which Zigstar is connected), zigbee2mqtt try to reconnect again and it succeeds. the same thing if I reboot Zigstar while working.

ashishpandey commented 8 months ago

This seems to in the territory of "works on my machine" type of problem. What can we do to help investigate what we are seeing? Are there any logs? @ffries mentioned syslog, I am curious if I can look at that?

zigbee2mqtt reconnect behaviour you mention is also interesting. What some of us have been seeing is zigbee2mqtt quits when adapter disconnects (the linked issue is essentially that). I am on zigbee2mqtt 1.35.0 (issue was reported at 1.34.0-1). Same happens if I restart UZG

I have switched back the UZG to DHCP for now, to capture more of restarts at zigbee2mqtt end where I can see some logs. It is reproducible every 2 hours for me (or whatever I set the DHCP lease time to). Unfortunately, I only see Adapter disconnected, stopping in zigbee2mqtt logs

xyzroe commented 8 months ago
  1. The system log he mentioned is just z2m, take a closer look at @ffries screenshot.

  2. If you run z2m as an add on in HA, this is natural behavior. If you are using a clear z2m, your task will be to take care of restarting in case of a fall.

  3. After receiving a new DHCP record, esp32 restarts the network interface. This is typical behavior for the libraries used in the project. Naturally, after restarting the network, all connections need to be re-established. In my case, like most users, this is done by the add on mechanism in HA, simply restarting z2m if it has stopped. So you can use a static address, or change the lifetime of the DHCP record to 10+ years, but the most correct way is to ensure that z2m is restarted in case of a crash. Because a socket connection to the adapter is used, sooner or later, but breaks happen. In your case, this will stop the entire Zigbee network.

docstalek commented 8 months ago

If the DHCP renewal results in a new IP, then it makes sense to restart the network interface. If, however, the IP stays the same (which is true in all the reported cases (permanent/static reservation on the DHCP server), then the network interface should not restart. All my network based equipment keeps their network up and running even after countless of DHCP renewals. UZH is the only one that drops the connections.

Automatic restart of z2m in event of a crash is something all should have configured (either with the HA addon or some other way). However, a restart takes time and during that period the zigbee network will not function properly. In my opinion, this is not an acceptable solution to this issue.

Disabling DHCP in the UZG has turned out to be the only stable solution for me, but I still feel like DHCP should work better, i.e. renweals should not restart the network interface.

xyzroe commented 8 months ago

Just made some tests on my UZG-01. So what I found: socket connection doesn't drops while DHCP renew. First two screens were made just after DHCP update. Second two were made just after start of z2m.

Screenshot_2024-01-17-14-21-00-703_com mikrotik android tikapp

Screenshot_2024-01-17-14-21-06-520_com android chrome

0415/67b4e834-2f37-498a-9c07-7466d656cf1c) Screenshot_2024-01-17-13-58-21-839_com android chrome

Screenshot_2024-01-17-13-58-01-604_com mikrotik android tikapp

I was wrong about how ZigStar made DHCP renew, it didn't drop connection. So I don't know how to reproduce your behavior. I'm using Mikrotik as DHCP server.

mercenaruss commented 8 months ago

All users reporting this are using pfSense.

docstalek commented 8 months ago

I will switch to a DHCP server outside of pfSense and test with that. Will report back my findings.

10Thirty commented 8 months ago

For what it is worth, I have been running a UGZ-1 with a static assignment via pfSense and a DHCP lease time of 2 hours for a few weeks now without any drops.

Which version of pfSense and DHCP daemon are you running? As 23.09 of pfSense+ added Kea DHCP an opt-in feature preview, which is what I am running on a Netgate SG-3100.

Oliviakrkk commented 8 months ago

Hi, I have Mikrotik as dhcp serwer. Had a static IP configured via dhcp and experienced the stability issues. Now I configured the static IP on my UZG and looks like most of my problems are gone.

I had a lot of problems with motion sensors and light automation. Light wouldn't turn in on motion and turned on by themselves at night (super anti sleep therapy ...)

So basically Mikrotik is also causing the problem.

xyzroe commented 8 months ago

Are you saying that changing the IP address settings affected the stability and speed of the entire Zigbee network? I think you're wrong. something else changed while you changing your IP settings

fliespl commented 8 months ago

I am having same issue with uzg-01 even with static ip setup. Today it broke twice.

On my end I am using mikrotik + netgear poe switch.

Any ideas how to enable remote syslog to see what's happening and in the end killing z2m?

xyzroe commented 8 months ago

What Zigbee chip do you have? P7 has another another problem with the same behavior.

fliespl commented 8 months ago

@xyzroe I believe it's P7 since my device was ordered 2 weeks ago.

mercenaruss commented 7 months ago

@fliespl Please full list of your ZigBee device in the network. Seems is a issue with P7 firmware, will be resolved soon by Koenkk

fliespl commented 7 months ago

@mercenaruss will that help, or do you need something else?

Also... Updated to UZG-01 to version 0.2.0 two days ago and I didn't have to restart it yet... Will let you know if it does it again.

Total 54
By device type
End devices: 35
Router: 19
By power source
Battery: 35
Mains (single phase): 18
DC Source: 1
By vendor
LUMI: 11
IKEA of Sweden: 10
Danfoss: 4
_TZ3000_dowj6gyi: 3
_TZ3000_gvn91tmx: 3
HEIMAN: 2
_TZE200_81isopgh: 2
_TZ3000_mrpevh8p: 2
_TZE200_znbl8dj5: 1
_TZ3000_xabckq1v: 1
_TZ3000_gjnozsaz: 1
_TZ3000_mg4dy6z6: 1
_TZE204_t1blo2bj: 1
_TZE204_ztc6ggyl: 1
_TZ3000_ja5osu5g: 1
_TZE200_ga1maeof: 1
_TZE204_k7mfgaen: 1
_TZE204_sooucan5: 1
_TZE200_hl0ss9oa: 1
_TZE204_sbyx0lm6: 1
_TZ3000_fa9mlvja: 1
Danfoss: 1
_TZE200_9yapgbuv: 1
_TZ3000_saiqcn0y: 1
_TZ3000_bguser20: 1
By model
TS0601: 10
TS0201: 5
TS011F: 4
lumi.magnet.acn001: 4
eTRV0103: 4
TRADFRIbulbGU10WS345lm: 3
TS0041: 3
TRADFRIbulbE27WSglobeclear806lm: 2
lumi.sensor_wleak.aq1: 2
TS004F: 2
lumi.sensor_magnet.aq2: 2
SmokeSensor-EM: 2
Remote Control N2: 1
TRADFRI Driver 30W: 1
lumi.plug.maeu01: 1
TRADFRI motion sensor: 1
lumi.sensor_cube.aqgl01: 1
TS0202: 1
TRADFRI bulb E27 CWS 806lm: 1
lumi.magnet.ac01: 1
STARKVIND Air purifier: 1
TS0225: 1
TRV003: 1
docstalek commented 7 months ago

I have swithched to another DHCP server, and the problem seems to be fixed. There might be a problem with ISC DHCP (which was shipped with pfSense earlier). ISC DHCP has reached end-of-life and is replaced with Kea DHCP in pfSense.

Migrating to Kea DHCP is quite easy (push of a button) in pfSense, so I reccomend doing that if you experience problems with DHCP.

fliespl commented 6 months ago

@mercenaruss do you know if P7 problem was resolved? I did have a connection break like 3 times this week (with static ip).