home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
72.67k stars 30.43k forks source link

Frequent disconnects device connected via MQTT #62507

Closed bazmk2 closed 2 years ago

bazmk2 commented 2 years ago

The problem

The short story: Device connected via MQTT frequently drop out and reconnect. This means that in the UI they grey out frequently or do not respond appropriately.

The Long Story: This issue began this week with frequent drop outs of devices that I first noticed on Zigbee Devices. I had a fairly complex setup with 2 instances of HA setup. One was on a VM via a supervised install, the other on an orangePiPC on armbian install (supervised). The rationale for this was that the VM (on ESXi) was the powerhouse and the OPi was the zwave and zigbee gateway. The 2 instancew were connected via the zwavejs integration pointed at the OPi IP and via MQTT from zigbee2mqtt. The MQTT broker was hosted on the VM. Additional info - the VM is hosted on ESXi 6.7 running on a debian 10 OS. The decision to run a seperate instance on the OPi was due to the ESXi host machine being a HP DL380 rack server a bit noisy and lives in the garage but the zwave and zigbee performance was poor due to radio interferance and range as it wasnt in the house - the OPi is

This all worked perfectally. This week I have upgraded to the latest version of HA core and also upgraded the Mosquitto to the latest. I am not sure where the issue is with the MQTT or the HA core instance - I think it is mosquitto.

I get regular client disconnects in the MQTT log, my recorder shows frequent gaps i.e. a light is not off from 10:00 to 120:00 but off from 10:00to10:05, 10:07 to 10:23, 10:24 to ......... You get the drift

In my frustration I have

What version of Home Assistant Core has the issue?

core-2021.12.4

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant Supervised

Integration causing the issue

MQTT

Link to integration documentation on our website

https://www.home-assistant.io/integrations/mqtt/

Example YAML snippet

No response

Anything in the logs that might be useful for us?

Mosquitto log

1640103447: New connection from 192.168.110.43 on port 1883.
1640103447: New client connected from 192.168.110.43 as DVES_7678F2 (p2, c1, k30, u'home').
1640103447: New connection from 192.168.110.153 on port 1883.
1640103447: New client connected from 192.168.110.153 as DVES_0C56F3 (p2, c1, k30, u'home').
1640103452: Client DVES_4DEA0F has exceeded timeout, disconnecting.
1640103453: New connection from 192.168.110.150 on port 1883.
1640103453: New client connected from 192.168.110.150 as DVES_4DEA0F (p2, c1, k30, u'home').
1640103464: Client DVES_5DFDC3 has exceeded timeout, disconnecting.
1640103464: Client DVES_FFD67B has exceeded timeout, disconnecting.
1640103464: New connection from 192.168.110.48 on port 1883.
{"result": "ok", "data": {}}1640103464: New client connected from 192.168.110.48 as DVES_5DFDC3 (p2, c1, k30, u'home').
1640103464: New connection from 192.168.110.135 on port 1883.
1640103464: New client connected from 192.168.110.135 as DVES_FFD67B (p2, c1, k30, u'home').
1640103466: Client DVES_EB0C1F has exceeded timeout, disconnecting.
1640103466: Client DVES_98B603 has exceeded timeout, disconnecting.
1640103466: New connection from 192.168.110.152 on port 1883.
1640103466: New client connected from 192.168.110.152 as DVES_EB0C1F (p2, c1, k30, u'home').
1640103466: New connection from 192.168.110.28 on port 1883.
1640103466: New client connected from 192.168.110.28 as DVES_98B603 (p2, c1, k30, u'home').
1640103467: Client DVES_326A65 has exceeded timeout, disconnecting.
1640103467: New connection from 192.168.110.132 on port 1883.
1640103467: New client connected from 192.168.110.132 as DVES_326A65 (p2, c1, k30, u'home').
1640103468: Client DVES_979240 has exceeded timeout, disconnecting.
1640103468: Client home has exceeded timeout, disconnecting.
1640103468: Client DVES_F96C65 has exceeded timeout, disconnecting.
1640103468: Client DVES_B33ADA has exceeded timeout, disconnecting.
1640103468: Client DVES_98A342 has exceeded timeout, disconnecting.
1640103468: New connection from 192.168.110.46 on port 1883.
1640103468: New client connected from 192.168.110.46 as DVES_B33ADA (p2, c1, k30, u'home').
1640103468: New connection from 192.168.110.80 on port 1883.
1640103468: New client connected from 192.168.110.80 as DVES_F96C65 (p2, c1, k30, u'home').
1640103469: New connection from 192.168.110.60 on port 1883.
1640103469: New client connected from 192.168.110.60 as DVES_979240 (p2, c1, k30, u'home').
1640103469: New connection from 192.168.110.23 on port 1883.
1640103469: New client connected from 192.168.110.23 as home (p2, c1, k30, u'home').
1640103469: New connection from 192.168.110.110 on port 1883.
1640103469: New client connected from 192.168.110.110 as DVES_98A342 (p2, c1, k30, u'home').
1640103469: Client DVES_767E67 has exceeded timeout, disconnecting.
1640103469: New connection from 192.168.110.111 on port 1883.
1640103469: New client connected from 192.168.110.111 as DVES_767E67 (p2, c1, k30, u'home').
1640103479: Client DVES_452E41 has exceeded timeout, disconnecting.

HA logs

2021-12-21 16:07:13 WARNING (SyncWorker_1) [homeassistant.loader] We found a custom integration hacs which has not been tested by Home Assistant. This component might cause stability problems, be sure to disable it if you experience issues with Home Assistant
2021-12-21 16:07:13 WARNING (SyncWorker_0) [homeassistant.loader] We found a custom integration hildebrandglow_dcc which has not been tested by Home Assistant. This component might cause stability problems, be sure to disable it if you experience issues with Home Assistant
2021-12-21 16:07:13 WARNING (SyncWorker_2) [homeassistant.loader] We found a custom integration mbapi2020 which has not been tested by Home Assistant. This component might cause stability problems, be sure to disable it if you experience issues with Home Assistant
2021-12-21 16:07:13 WARNING (SyncWorker_1) [homeassistant.loader] We found a custom integration skyq which has not been tested by Home Assistant. This component might cause stability problems, be sure to disable it if you experience issues with Home Assistant
2021-12-21 16:07:13 WARNING (SyncWorker_3) [homeassistant.loader] We found a custom integration icloud3 which has not been tested by Home Assistant. This component might cause stability problems, be sure to disable it if you experience issues with Home Assistant
2021-12-21 16:08:45 WARNING (MainThread) [homeassistant.bootstrap] Waiting on integrations to complete setup: mbapi2020
2021-12-21 16:09:57 WARNING (MainThread) [homeassistant.bootstrap] Waiting on integrations to complete setup: met, xbox
2021-12-21 16:10:13 WARNING (MainThread) [homeassistant.setup] Setup of mobile_app is taking over 10 seconds.
2021-12-21 16:10:13 WARNING (MainThread) [homeassistant.config_entries] Config entry '192.168.1.18:3493' for nut integration not ready yet; Retrying in background
2021-12-21 16:10:13 WARNING (MainThread) [homeassistant.config_entries] Config entry 'Kyocera ECOSYS M5521cdn (3)' for ipp integration not ready yet: Invalid response from API: Timeout occurred while connecting to IPP server.; Retrying in background
2021-12-21 16:10:18 WARNING (MainThread) [slixmpp.stringprep] Using slower stringprep, consider compiling the faster cython/libidn one.
2021-12-21 16:10:21 WARNING (MainThread) [slixmpp.basexmpp] Legacy XMPP 0.9 protocol detected.
2021-12-21 16:10:23 ERROR (SyncWorker_7) [homeassistant] Error doing job: Task was destroyed but it is pending!
2021-12-21 16:10:45 WARNING (MainThread) [homeassistant.components.dlna_dmr] Failed connecting to recently alive device at http://192.168.0.202:8091/XD/BO5EBO5E-F00D-F00D-FEED-A0F6FD7E9B00.xml: UpnpError('Could not find service state table element', None)
2021-12-21 16:10:56 WARNING (Thread-18) [matplotlib.font_manager] Matplotlib is building the font cache; this may take a moment.
2021-12-21 16:11:41 WARNING (MainThread) [homeassistant.components.media_player] Updating skyq media_player took longer than the scheduled update interval 0:00:10
2021-12-21 16:11:41 WARNING (MainThread) [homeassistant.helpers.entity] Update of media_player.sky_q is taking over 10 seconds
2021-12-21 16:11:41 WARNING (MainThread) [homeassistant.components.media_player] Updating dlna_dmr media_player took longer than the scheduled update interval 0:00:10

Additional information

No response

probot-home-assistant[bot] commented 2 years ago

mqtt documentation mqtt source (message by IssueLinks)

probot-home-assistant[bot] commented 2 years ago

Hey there @emontnemery, mind taking a look at this issue as it has been labeled with an integration (mqtt) you are listed as a code owner for? Thanks! (message by CodeOwnersMention)

emontnemery commented 2 years ago

This week I have upgraded to the latest version of HA core and also upgraded the Mosquitto to the latest. I am not sure where the issue is with the MQTT or the HA core instance - I think it is mosquitto.

Please explain more about your Mosquitto setup, are you using the official Mosquitto addon or something else?

Also, the mosquitto log shows your Tasmota devices reconnecting to the MQTT broker, but I don't see anything about Home Assistant reconnecting, does that appear to be stable?

If you execute command status11 on the Tasmota devices, what does it say?

emontnemery commented 2 years ago

@maframan, this could maybe be related to your problem

bazmk2 commented 2 years ago

Using the HA official Mosquitto addon 6.0.1

I can't see any HA disconnects to the Mosquitto. I think I did on the initial install but will have to spin the VM up to look. I presume that they would appear on the docker network IP.

Tasmota Device (UK Sonoff T1 3 gang) with 10.0.0

20:33:28.053 CMD: status11
20:33:28.059 MQT: stat/tasmota_98B603/STATUS = {"Status":{"Module":30,"DeviceName":"Tasmota","FriendlyName":["Tasmota","Tasmota2","Tasmota3"],"Topic":"tasmota_98B603","ButtonTopic":"0","Power":7,"PowerOnState":3,"LedState":1,"LedMask":"FFFF","SaveData":1,"SaveState":1,"SwitchTopic":"0","SwitchMode":[0,0,0,0,0,0,0,0],"ButtonRetain":0,"SwitchRetain":0,"SensorRetain":0,"PowerRetain":0,"InfoRetain":0,"StateRetain":0}}
20:34:05.704 MQT: Attempting connection...
20:34:06.457 MQT: Connected
20:34:06.462 MQT: tele/tasmota_98B603/LWT = Online (retained)
20:34:06.465 MQT: cmnd/tasmota_98B603/POWER = 
20:34:12.846 MQT: stat/tasmota_98B603/STATUS1 = {"StatusPRM":{"Baudrate":115200,"SerialConfig":"8N1","GroupTopic":"tasmotas","OtaUrl":"http://ota.tasmota.com/tasmota/release/tasmota.bin.gz","RestartReason":"Exception","Uptime":"1T02:30:35","StartupUTC":"2021-12-21T17:03:37","Sleep":50,"CfgHolder":4617,"BootCount":1013,"BCResetTime":"2020-08-28T12:11:45","SaveCount":6178,"SaveAddress":"F5000"}}
20:34:12.898 MQT: stat/tasmota_98B603/STATUS11 = {"StatusSTS":{"Time":"2021-12-22T20:34:12","Uptime":"1T02:30:35","UptimeSec":95435,"Heap":24,"SleepMode":"Dynamic","Sleep":50,"LoadAvg":20,"MqttCount":1906,"POWER1":"ON","POWER2":"ON","POWER3":"ON","Wifi":{"AP":1,"SSId":"daveandvickyiot","BSSId":"86:2A:A8:C4:79:CB","Channel":11,"Mode":"11n","RSSI":100,"Signal":-39,"LinkCount":1,"Downtime":"0T00:00:09"}}}
20:34:12.948 MQT: tele/tasmota_98B603/STATE = {"Time":"2021-12-22T20:34:12","Uptime":"1T02:30:35","UptimeSec":95435,"Heap":24,"SleepMode":"Dynamic","Sleep":50,"LoadAvg":20,"MqttCount":1906,"POWER1":"ON","POWER2":"ON","POWER3":"ON","Wifi":{"AP":1,"SSId":"daveandvickyiot","BSSId":"86:2A:A8:C4:79:CB","Channel":11,"Mode":"11n","RSSI":100,"Signal":-39,"LinkCount":1,"Downtime":"0T00:00:09"}}
20:34:12.951 MQT: stat/tasmota_98B603/RESULT = {"Time":"2021-12-22T20:34:12","Uptime":"1T02:30:35","UptimeSec":95435,"Heap":24,"SleepMode":"Dynamic","Sleep":50,"LoadAvg":20,"MqttCount":1906,"POWER1":"ON","POWER2":"ON","POWER3":"ON","Wifi":{"AP":1,"SSId":"daveandvickyiot","BSSId":"86:2A:A8:C4:79:CB","Channel":11,"Mode":"11n","RSSI":100,"Signal":-39,"LinkCount":1,"Downtime":"0T00:00:09"}}
20:34:38.232 MQT: tele/tasmota_98B603/STATE = {"Time":"2021-12-22T20:34:38","Uptime":"1T02:31:01","UptimeSec":95461,"Heap":22,"SleepMode":"Dynamic","Sleep":50,"LoadAvg":19,"MqttCount":1906,"POWER1":"ON","POWER2":"ON","POWER3":"ON","Wifi":{"AP":1,"SSId":"daveandvickyiot","BSSId":"86:2A:A8:C4:79:CB","Channel":11,"Mode":"11n","RSSI":100,"Signal":-39,"LinkCount":1,"Downtime":"0T00:00:09"}}
20:34:50.982 MQT: Attempting connection...
20:34:51.050 MQT: Connected
20:34:51.054 MQT: tele/tasmota_98B603/LWT = Online (retained)
20:34:51.058 MQT: cmnd/tasmota_98B603/POWER = 
20:34:52.164 MQT: stat/tasmota_98B603/STATUS1 = {"StatusPRM":{"Baudrate":115200,"SerialConfig":"8N1","GroupTopic":"tasmotas","OtaUrl":"http://ota.tasmota.com/tasmota/release/tasmota.bin.gz","RestartReason":"Exception","Uptime":"1T02:31:15","StartupUTC":"2021-12-21T17:03:37","Sleep":50,"CfgHolder":4617,"BootCount":1013,"BCResetTime":"2020-08-28T12:11:45","SaveCount":6178,"SaveAddress":"F5000"}}
20:34:52.214 MQT: stat/tasmota_98B603/STATUS11 = {"StatusSTS":{"Time":"2021-12-22T20:34:52","Uptime":"1T02:31:15","UptimeSec":95475,"Heap":22,"SleepMode":"Dynamic","Sleep":50,"LoadAvg":44,"MqttCount":1907,"POWER1":"ON","POWER2":"ON","POWER3":"ON","Wifi":{"AP":1,"SSId":"daveandvickyiot","BSSId":"86:2A:A8:C4:79:CB","Channel":11,"Mode":"11n","RSSI":100,"Signal":-39,"LinkCount":1,"Downtime":"0T00:00:09"}}}
20:34:52.263 MQT: tele/tasmota_98B603/STATE = {"Time":"2021-12-22T20:34:52","Uptime":"1T02:31:15","UptimeSec":95475,"Heap":21,"SleepMode":"Dynamic","Sleep":50,"LoadAvg":43,"MqttCount":1907,"POWER1":"ON","POWER2":"ON","POWER3":"ON","Wifi":{"AP":1,"SSId":"daveandvickyiot","BSSId":"86:2A:A8:C4:79:CB","Channel":11,"Mode":"11n","RSSI":100,"Signal":-39,"LinkCount":1,"Downtime":"0T00:00:09"}}
20:34:52.265 MQT: stat/tasmota_98B603/RESULT = {"Time":"2021-12-22T20:34:52","Uptime":"1T02:31:15","UptimeSec":95475,"Heap":21,"SleepMode":"Dynamic","Sleep":50,"LoadAvg":43,"MqttCount":1907,"POWER1":"ON","POWER2":"ON","POWER3":"ON","Wifi":{"AP":1,"SSId":"daveandvickyiot","BSSId":"86:2A:A8:C4:79:CB","Channel":11,"Mode":"11n","RSSI":100,"Signal":-39,"LinkCount":1,"Downtime":"0T00:00:09"}}
bazmk2 commented 2 years ago

Also I am not sure if this is related but I have this in the HA log today too

Logger: homeassistant.components.websocket_api.http.connection Source: components/websocket_api/http.py:126 Integration: Home Assistant WebSocket API (documentation, issues) First occurred: 00:19:11 (22 occurrences) Last logged: 19:23:42

[2703208968] Client unable to keep up with pending messages. Stayed over 512 for 5 seconds [2793777520] Client unable to keep up with pending messages. Stayed over 512 for 5 seconds [2783955096] Client unable to keep up with pending messages. Stayed over 512 for 5 seconds [2792435848] Client unable to keep up with pending messages. Stayed over 512 for 5 seconds [2747011784] Client unable to keep up with pending messages. Stayed over 512 for 5 seconds

maframan commented 2 years ago

i suspect that the problem is caused by mosquitto watch dog feature because before activation I did not have this drop connection situation . Now i've disabled watchdog and tomorrow I'll let you know

emontnemery commented 2 years ago

@bazmk2 OK, so something is seriously wrong with either your Tasmota device or the Mosquitto broker, "MqttCount":1907 means Tasmota has reconnected 1907 times to the broker during just 1 day.

maframan commented 2 years ago

i suspect that the problem is caused by mosquitto watch dog feature because before activation I did not have this drop connection situation . Now i've disabled watchdog and tomorrow I'll let you know

Today, the situation after Watchdog disabled, is very stable. I think that the disconnection/drop problem has been resolved.

bazmk2 commented 2 years ago

So I have deleted my previous comments and logs as I beleive they are irrelevant.

I believe I have solved the issue and it is to do with the tasmota software on the devices. I have found that the later versions of the software (not sure exactly where) change the way that they act in terms of wifi and reboot/reconnect. It turns out that I had set a NTP server in my DHCP some time previous that I had forgotten about. However, on "update day" - i chose to update OTA the HA, mosquitto and tasmota software around the same time I beleive that the Tasmota update introduced the issue with the NTP server and caused the reconnects.

Yesterday I looked at the device list in tasmota-admin and all of the devices except one still on 9.x had uptimes of a minutes. The one on 9.x was at 17 days.

Now of course I was still fiddling - I was trying to fix this so I have introduced some other confounders - the MQTT brocker I use is not a seperate debian machine with HiveMQ setup but it wasnt until I removed the NTP setting that I got uptimes of over an hour for the tasmota devices.

Hopefully this may be of assistance to another tinkerer - my one message is change 1 thing and WAIT!