technyon / nuki_hub

Use an ESP32 as a Hub between a NUKI Lock and your smarthome.
MIT License
526 stars 40 forks source link

"Client nukihub already connected, closing old connection" with v7 #87

Closed mahikeulbody closed 1 year ago

mahikeulbody commented 1 year ago

Hi,

I upgraded my nuki hub with the last firmware. Now I have many :

Client nukihub already connected, closing old connection New client connected from 192.168.0.227:58397 as nukihub"

With the previous version (v6.11) I had some reconnections because of malformed mqtt packet. It seems I have not anymore these malformed packet, only these new messages.

technyon commented 1 year ago

Which MQTT Broker are you using? Also, can you define "many" ... is it several per minute, or every few minutes or ... ?

mahikeulbody commented 1 year ago

Mosquito (hass addon) I got these messages 5 times in the last hour.

mahikeulbody commented 1 year ago

These messages stopped this night. Now I have this kind of messages (I replaced my user by xxx)

2023-01-21 06:57:05: Client nukihub disconnected: Success. 2023-01-21 06:57:05: New connection from 192.168.0.227:58182 on port 1883. 2023-01-21 06:57:05: New client connected from 192.168.0.227:58182 as nukihub (p2, c1, k60, u'xxx')

2023-01-21 07:01:16: Client nukihub disconnected due to protocol error. 2023-01-21 07:01:16: New connection from 192.168.0.227:58183 on port 1883. 2023-01-21 07:01:16: New client connected from 192.168.0.227:58183 as nukihub (p2, c1, k60, u'xxx').

2023-01-21 07:09:55: Client nukihub disconnected. 2023-01-21 07:09:55: New connection from 192.168.0.227:58184 on port 1883. 2023-01-21 07:09:55: New client connected from 192.168.0.227:58184 as nukihub (p2, c1, k60, u'xxx').

2023-01-21 07:20:59: Client nukihub disconnected due to protocol error. 2023-01-21 07:20:59: New connection from 192.168.0.227:58185 on port 1883. 2023-01-21 07:20:59: New client connected from 192.168.0.227:58185 as nukihub (p2, c1, k60, u'xxx').

I didn't restart nuki hub by myself.

Edit : finally not, sorry 2023-01-21 08:17:56: New connection from 192.168.0.227:53433 on port 1883. 2023-01-21 08:17:56: Client nukihub already connected, closing old connection. 2023-01-21 08:17:56: New client connected from 192.168.0.227:53433 as nukihub (p2, c1, k60, u'xxx')

amastrogiacomo1968 commented 1 year ago

I too have the same messages (I replaced my user by xxx): 2023-01-21 05:45:15: Client nukihub disconnected due to malformed packet. 2023-01-21 05:45:15: New connection from 192.168.1.84:51856 on port 1883. 2023-01-21 05:45:15: New client connected from 192.168.1.84:51856 as nukihub (p2, c1, k15, u'xxx').

mahikeulbody commented 1 year ago

They are not the same messages, yours are about "malformed packets". I have no more "malformed packets" since I am using nuki hub v7. Are you using v7 ?

mundschenk-at commented 1 year ago

For me, the malformed packets are gone, but there seems to be an issue with reconnects. Besides just reconnecting often, I now also see these errors every few hours (looks like rate limiting by the broker):

2023-01-21 07:15:35: Bad client <nukihub> sending multiple CONNECT messages.
2023-01-21 07:15:35: Client <nukihub> disconnected due to protocol error.
mahikeulbody commented 1 year ago

Me too

2023-01-21 10:42:12: Bad client nukihub sending multiple CONNECT messages.
2023-01-21 10:42:12: Client nukihub disconnected due to protocol error.
amastrogiacomo1968 commented 1 year ago

They are not the same messages, yours are about "malformed packets". I have no more "malformed packets" since I am using nuki hub v7. Are you using v7 ?

I'm stuck at 6.11, since I'm leaving for a week's vacation, I don't feel like updating and leaving.

mundschenk-at commented 1 year ago

I'm stuck at 6.11, since I'm leaving for a week's vacation, I don't feel like updating and leaving.

That's fine, but then your log is not relevant to this issue.

alexdelprete commented 1 year ago

I'm stuck at 6.11, since I'm leaving for a week's vacation, I don't feel like updating and leaving.

You should upgrade, v7 is much better because it solves the malformed topics issue that causes reboots. Upgrade asap.

mundschenk-at commented 1 year ago

I am logging the nuki/maintenance/log topic and I see several apparent reboots, but also lots of MQTT reconnects that are not related to reboots during the night (i.e. several MQTT connected. messages with nothing else in between, spaced over an hour maybe).

Mincka commented 1 year ago

I am logging the nuki/maintenance/log topic and I see several apparent reboots, but also lots of MQTT reconnects that are not related to reboots during the night (i.e. several MQTT connected. messages with nothing else in between, spaced over an hour maybe).

You can log your uptime, it's easiest to track the duration between reboots.

mqtt:
  sensor:
    - name: "Nuki Entrée Uptime"
      state_topic: "nuki_entree/maintenance/uptime"
      device_class: duration
      unit_of_measurement: min 

image

I have a lot of reboots without restart messages so I don't know the root cause. Scheduled restart is disabled on both ESP. I will see if not sending RSSI + QoS=1 + CS=False from https://github.com/technyon/nuki_hub/issues/51#issuecomment-1400664596 helps.

mundschenk-at commented 1 year ago

Thanks, yeah, the reboot messages might also simply be lost if the MQTT connection is not reestablished in a timely manner.

alexdelprete commented 1 year ago

I'm using the 7.0-cs-false fw, I don't have reboots, luckily:

image

image

This is my config:

image

image

Mincka commented 1 year ago

I'm using the 7.0-cs-false fw, I don't have reboots, luckily:

Argh, I have this behavior with 7.0-cs-false. :-/ I will test 7.1 with RSSI disabled before testing anything related to CS / QoS.

alexdelprete commented 1 year ago

Strange that we observe totally different results. I'm starting to question the broker...you both use Mosquitto right?

mundschenk-at commented 1 year ago

Actually, I don't any more. But I've no adapted my restart settings to yours, so we can compare tomorrow.

alexdelprete commented 1 year ago

Actually, I don't any more. But I've no adapted my restart settings to yours, so we can compare tomorrow.

I was replying to Julien actually. Yes I know you're using EMQX now, welcome onboard. :)

Did you see reconnection improvements since you switched to EMQX?

mundschenk-at commented 1 year ago

Ah, I thought you meant as both (because, well "both"). No, the reconnect behavior has not improved. Here's the most current log. The first MQTT connected at 19:13 is after the last reboot (me changing the settings). Everything else MQTT reconnects where the ESP did not reboot (and some buffer corruptions).

2023-01-23 19:13:27.652456+01:00: MQTT connected
2023-01-23 19:13:58.478196+01:00: 1nuki/lock/r
2023-01-23 19:13:58.673789+01:00: MQTT connected
2023-01-23 19:22:58.017850+01:00: MQTT connected
2023-01-23 19:31:03.502848+01:00: MQTT connected
2023-01-23 19:31:33.868528+01:00: 1nuki/lock/r
2023-01-23 19:31:34.019050+01:00: MQTT connected
2023-01-23 19:32:04.152625+01:00: -59T connected
2023-01-23 19:32:04.332708+01:00: MQTT connected
2023-01-23 19:32:34.490780+01:00: 1nuki/maner/
2023-01-23 19:32:34.761026+01:00: MQTT connected
2023-01-23 19:34:26.566261+01:00: MQTT connected
2023-01-23 19:35:12.852033+01:00: MQTT connected
2023-01-23 19:41:37.360543+01:00: MQTT connected
2023-01-23 19:41:39.408436+01:00: MQTT connected
2023-01-23 19:46:57.570165+01:00: MQTT connected
alexdelprete commented 1 year ago

Is the wifi connection reliable? What Wifi RSSI values do you have on average? That log comes from the serial interface?

It's so strange I don't see these disconnections...

mundschenk-at commented 1 year ago

Should be. There were no reconnects to the AP. RSSI in that timeframe was about -50 dB. There's a gap between 19:31:01 and 19:32:36 though where no RSSI values were recorded.

Edit: The log is from MQTT logging.

Mincka commented 1 year ago

Just onboarded on the EMQX wagon (thanks!) and using 7.1. I will quickly see if there's any improvement regarding the uptime. Migration was quite easy. The only thing I missed was setting the "superuser" role for the user connecting using MQTT Explorer but reading the logs helped. :)

mundschenk-at commented 1 year ago

Mhm, I just noticed lots of lines like these in the broker's log:

2023-01-23T21:09:24.351408+01:00 [warning] clientid: <nukihub>, line: 446, mfa: emqx_channel:handle_in/2, msg: puback_packetId_not_found, packetId: 2, peername: <ip>:<port>
Mincka commented 1 year ago

Which version do you use currently? No message like this in my EMQX logs with 7.1. puback is related to QoS=1, doesn't it?

mundschenk-at commented 1 year ago

Yes, that's the new QoS-1 version published in https://github.com/technyon/nuki_hub/issues/51#issuecomment-1400664596

Mincka commented 1 year ago

Ok. I'll wait a little bit to see how things are going with 7.1 + EMQX and I'll report back with my dedicated Nuki Debug view. 😁 Settings aligned with @alexdelprete and using 60000 ms as RSSI update interval.

image

alexdelprete commented 1 year ago

Mhm, I just noticed lots of lines like these in the broker's log:

2023-01-23T21:09:24.351408+01:00 [warning] clientid: <nukihub>, line: 446, mfa: emqx_channel:handle_in/2, msg: puback_packetId_not_found, packetId: 2, peername: <ip>:<port>

seems like it's not sending ACKs with QoS=1. Better let @technyon know about this. I hoped the library managed this internally with the different levels of QoS.

alexdelprete commented 1 year ago

Settings aligned with @alexdelprete and using 60000 ms as RSSI update interval.

I thought you said 2s is not too fast, and now you switched to 60s? :)

Mincka commented 1 year ago

Settings aligned with @alexdelprete and using 60000 ms as RSSI update interval.

I thought you said 2s is not too fast, and now you switched to 60s? :)

I'm so desperate about finding what's causing reboots I need to test anything. I promise I'll send you a coffee if that was the root cause. :D

In the meantime, I hope the QoS issues are not related to https://github.com/arduino-libraries/ArduinoMqttClient/issues/39. It looks like quite convoluted to implement.

alexdelprete commented 1 year ago

I promise I'll send you a coffee if that was the root cause.

If you don't mind, I prefer to offer you my italian coffee instead. :)

But I didn't say that's the root cause, I was saying that I don't believe in coincidences, and the RSSI topic, at least in my case, was typically the malformed one.

I know the root cause: not very good mqtt libraries. ;)

Maybe we should start looking at another library...:(

Mincka commented 1 year ago

Ma dai! Already one reboot after 27 minutes.

alexdelprete commented 1 year ago

Ma dai! Already one reboot after 27 minutes.

This is crazy, look at mine:

image

And I updated the fw, that's because it rebooted...

So we can exclude:

Mincka commented 1 year ago

Lucky you. I am the only one to have this kind of issue so the only thing we don't have in common is the Wi-Fi AP.

image

Nothing fancy here. I try to fully disable the RSSI updates, just to be 100% sure. xD

alexdelprete commented 1 year ago

What is it? Netgear AP?

Is there a device specific page on the AP where you can see some detail about the ESP32 device? Do you see disconnections on Wifi?

Mincka commented 1 year ago

I'm only starting to look at this but I can already see strange things. image

I'm starting troubleshooting with disabling of "Band Steering / 802.11 k/v". It's a good candidate for this kind of trouble.

alexdelprete commented 1 year ago

always remember: "it's always DNS" - corollaries: "it's always DHCP" / "it's always WiFi" :)

Now check the timestamps to find the correlation.

mundschenk-at commented 1 year ago

Interestingly, I've seen the device drop out from the network during the night, wenn it's DHCP lease time was up. Apparently it did not try to reconnect/reboot, despite WiFi being stable for other devices. Network disconnect timeout was set to 60 seconds, but that apparently didn't help (so the controller probably crashed?)

alexdelprete commented 1 year ago

That would mean the IP stack is bugged. I don't think that's the case. I'm starting to think about this paragraph written in Nuki Hub readme:

image

It has something to do with some APs. If you can, on the AP, create a WLAN with conservative (legacy) parameters, just to see if it improves the connection. Might be some tricky configuration that doesn't work with ESP32.

Personally I use a mesh based on the Fritz!Box and 4 repeaters. I have 30 Tasmota devices (ESP826x and ESP32) and never suffered from this.

mundschenk-at commented 1 year ago

I really don't think this is it. The WiFi already is my dedicated IoT VLAN with pretty conservative WiFi settings (but not totally, so no 802.11b 😅). The freezing thing happen's very sporadically (like every few weeks), so I think some sort of power fluctuation from the dodgy electrical-plug-with-USB is more likely.

Mhm, maybe I should try a different power supply for a while. (Current uptime is 9 hours and counting, so it's not happening all time - there are however lot's of MQTT reconnects.)

Edit: OK, not counting anymore. While I was typing this, the ESP rebooted.

Mincka commented 1 year ago

I disabled everything related to restart in the options but I still have reboots from time to time. This time, I have the logs from the AP and the router. I can see that the device is "disassociated". I think the ESP crashes first, not because of a connection drop. It does not reboot because of connection drop since "Restart on disconnect" is not checked and "Network Timeout until restart" is set to "-1". It may be related to the memory corruption issues I saw when logging on serial.

AP (from bottom to top)

Jan 24 20:21:06 nddmp[4268]: alarm : seqNo-[1789], level-[OTHER], info-[{"ip":"10.0.0.186","mac":"4C-75-25-xx-xx-xx","hname":"4C-75-25-xx-xx-xx","txR":"43.30 Mbps","rxR":"72.20 Mbps","ssid":"xxx,"dOs":"Debian-based Linux","dType":"Laptop/PC","mode":"11NG","status":"Authenticated","bssid":"C8-9E-43-xx-xx-xx","chnl":"1","chWidth":"20 MHz","rssi":"36","state":"QOS/ERP/HT","type":"wpa2","idle":"0","time":"00:00:00","txB":"990","rxB":"744","pmf":"none","vlanID":"1","Username":"Not Applicable","radio":"0"}]
Jan 24 20:21:05 nddmp[4268]: alarm : seqNo-[1788], level-[INFO], info-[esp32-xxxxxx (4C-75-25-xx-xx-xx) wireless client connected]
Jan 24 20:21:05 hostapd: wifi0vap0: STA 4c:75:25:xx:xx:xx WPA: pairwise key handshake completed (RSN)
Jan 24 20:21:05 hostapd: wifi0vap0: STA 4c:75:25:xx:xx:xx WPA: received EAPOL-Key frame (4/4 Pairwise)
Jan 24 20:21:05 hostapd: wifi0vap0: STA 4c:75:25:xx:xx:xx WPA: sending 3/4 msg of 4-Way Handshake
Jan 24 20:21:05 hostapd: wifi0vap0: STA 4c:75:25:xx:xx:xx WPA: received EAPOL-Key frame (2/4 Pairwise)
Jan 24 20:21:05 hostapd: wifi0vap0: STA 4c:75:25:xx:xx:xx WPA: sending 1/4 msg of 4-Way Handshake
Jan 24 20:21:05 hostapd: wifi0vap0: STA 4c:75:25:xx:xx:xx IEEE 802.11: associated (aid 1)
Jan 24 20:21:05 configd[4167]: EEM: Received Event: NEW_STA_ASSOC_EEM
Jan 24 20:21:05 hostapd: wifi0vap0: STA 4c:75:25:xx:xx:xx IEEE 802.11: authenticated
Jan 24 20:21:05 nddmp[4268]: alarm : seqNo-[1787], level-[INFO], info-[esp32-xxxxxx (4C-75-25-xx-xx-xx) wireless client disconnected]
Jan 24 20:21:05 hostapd: wifi0vap0: STA 4c:75:25:xx:xx:xx IEEE 802.11: disconnected
Jan 24 20:21:05 hostapd: wifi0vap0: STA 4c:75:25:xx:xx:xx IEEE 802.11: disassociated

Router (from top to bottom)

Tue Jan 24 20:21:05 2023 daemon.info dnsmasq-dhcp[1]: DHCPDISCOVER(eth2) 4c:75:25:xx:xx:xx
Tue Jan 24 20:21:05 2023 daemon.info dnsmasq-dhcp[1]: DHCPOFFER(eth2) 10.0.0.186 4c:75:25:xx:xx:xx
Tue Jan 24 20:21:05 2023 daemon.info dnsmasq-dhcp[1]: DHCPREQUEST(eth2) 10.0.0.186 4c:75:25:xx:xx:xx
Tue Jan 24 20:21:05 2023 daemon.info dnsmasq-dhcp[1]: DHCPACK(eth2) 10.0.0.186 4c:75:25:xx:xx:xx esp32-xxxxxx

After the reboot, it looks it took about 2 minutes before getting Presence updates again. (not in the logs above but I track the last_changed of the presence/devices topic.

Actually, the less that happens on the ESP, the more stable it seems.

I just disabled presence/devices update by setting "-1" on "Presence detection timeout". So there's only the maintenance/uptime that is published when the lock is idle. Since nothing is happening, it may also help to reduce random memory leaks / corruptions.

alexdelprete commented 1 year ago

Are you both using VLANs for these devices? Are there traffic restrictions between the IoT VLAN and Nuki Hub?

mundschenk-at commented 1 year ago

@Mincka, If you device rebooted, then you need to look earlier then the Jan 24 20:21:05 hostapd: wifi0vap0: STA 4c:75:25:xx:xx:xx IEEE 802.11: disassociated line. I am pretty sure that by this point, the device was already starting up, because the next log line is the authentication for the new WiFi connection (at the same timestamp of 20:21:05. So whatever happened on the ESP side, it must have come earlier.

Mincka commented 1 year ago

There's nothing in the logs before. No VLANs. Mixed 2.4 Ghz/5 Ghz without other issues.

One of the ESP had an unrecoverable crash this night (did not manage to restart).

Jan 25 04:28:23 nddmp[4268]: alarm : seqNo-[1824], level-[INFO], info-[esp32-xxxxxx (4C-75-25-xx:xx:xx) wireless client disconnected]
Jan 25 04:28:22 kernel: [323338.424112] ieee80211_send_deauth: [wifi0vap0]: Sending deauth to 4c:75:25:xx:xx:xx with reason 34
Jan 25 04:28:22 hostapd: wifi0vap0: STA 4c:75:25:xx:xx:xx IEEE 802.11: disconnected
Jan 25 04:28:22 hostapd: wifi0vap0: STA 4c:75:25:xx:xx:xx IEEE 802.11: disassociated

34 - Association denied due to excessive frame loss rates and/or poor conditions on currentoperating channel

https://community.cisco.com/t5/wireless-mobility-knowledge-base/802-11-association-status-802-11-deauth-reason-codes/ta-p/3148055

The uptime stopped updating 2 minutes before the disconnection: image

The presence/devices topic also stopped updated at that time, so before disconnection of the Wi-Fi: image

I thing this one is a legitime use case for restart on BT loss so I am going to set this setting again, with 600 seconds of delay.

The other ESP crashed this morning and managed to reboot. Exact same logs as in the previous post.

What are your usual uptimes @mundschenk-at ?

alexdelprete commented 1 year ago

34 - Association denied due to excessive frame loss rates and/or poor conditions on currentoperating channel

https://community.cisco.com/t5/wireless-mobility-knowledge-base/802-11-association-status-802-11-deauth-reason-codes/ta-p/3148055

Like I said in previous posts, it looks like a wifi compatibility issue...

Do you have other ESP32 devices on wifi, with different fw (tasmota, etc.)?

Mincka commented 1 year ago

A dozen of Shelly which are ESP-based, no issue with them. This error 34 was the only one in the logs. My latest idea is to connect two different ESPs on the same lock, at the same time.... Of course it's messed up in HA because of the discovery but I still have my manually configured MQTT sensors so I can track uptime and logs. That's enough to see if they fail together or not. It could be considered like a failover system. 😂

alexdelprete commented 1 year ago

I have 20 shellies, but they're all ESP82xx based. Only the new ones are ESP32 based. Which models do you have?

My latest idea is to connect two different ESPs on the same lock, at the same time....

You're complicating things...:)

Do you have only one kind of device? What device are you using? It would be good to check 2 different devices to see if the behaviour is exactly the same...

Mincka commented 1 year ago

One is the M5 Atom Lite and the other one is a ESP32 DevKitV1.

alexdelprete commented 1 year ago

I'm using the Atom U. What about the power? How are you powering it?

I'm trying to exclude the basics, but I keep thinking it's WiFi.

Mincka commented 1 year ago

Independent Samsung chargers with at least 1A output. On the Wi-Fi side, I try to disable "WMM Powersave", however, from what I see, it should be supported by ESP32 chips, but that's just another test...

alexdelprete commented 1 year ago

Julien, another test: if you have another M5 Atom, can you flash tasmota and see if it disconnects often? So we can exclude the fw.