home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
72.67k stars 30.42k forks source link

MQTT Integration does not reconnect if connection lost to server #50892

Closed EmilyNerdGirl closed 3 years ago

EmilyNerdGirl commented 3 years ago

The problem

Been having a long standing issue where if the MQTT integration loses connectivity to the mqtt serrver, it will fail to reconnect until restarting HA, or sometimes will not until auto-reconnect 2-3 hours later . Second permutation of this is if it is connected, and I go to reconfigure the mqtt server after a couple days to check something, it will hang and fail to connect to the existing server it's connected to on the first dialog with connection info.

What is version of Home Assistant Core has the issue?

core-2021.5.5

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

mqtt

Link to integration documentation on our website

https://www.home-assistant.io/integrations/mqtt/

Example YAML snippet

No response

Anything in the logs that might be useful for us?

Logger: homeassistant.components.mqtt
Source: components/mqtt/__init__.py:941
Integration: MQTT (documentation, issues)
First occurred: 5:18:58 AM (30 occurrences)
Last logged: 7:48:59 AM

No ACK from MQTT server in 10 seconds (mid: 773)
No ACK from MQTT server in 10 seconds (mid: 774)
No ACK from MQTT server in 10 seconds (mid: 775)
No ACK from MQTT server in 10 seconds (mid: 776)
No ACK from MQTT server in 10 seconds (mid: 777)

Logger: homeassistant.components.mqtt
Source: components/mqtt/__init__.py:925
Integration: MQTT (documentation, issues)
First occurred: May 19, 2021, 6:41:03 PM (4 occurrences)
Last logged: 5:17:48 AM

Disconnected from MQTT server mqttserver:1883 (1)

Logger: paho.mqtt.client
Source: /usr/local/lib/python3.8/site-packages/paho/mqtt/client.py:2404
First occurred: May 19, 2021, 10:56:09 PM (2 occurrences)
Last logged: May 19, 2021, 10:57:30 PM

failed to receive on socket: [Errno 104] Connection reset by peer

Additional information

Some history here, I've had an Eclipse Mosquitto server running for some several years even before using HA (upgraded mosquitto every few weeks so current), I have multiple clients such as Home Assistant, Telegraf, MQTT Explorer, dozens of mqtt IoT devices etc running. It was running off an external mqtt server, not within a container/supervisor with HA. I've noticed when HA loses connection, other clients such as Telegraf, MQTT Explorer and the IoT devices/sensors will stay connected without any problems. Sometimes the mqtt integration will reconnect after an odd amount of time like 2 - 3 hours.

I had noticed it most prominently when I would stop the mqtt server to add a new user account, HA mqtt would lose connection and never reconnect while other clients weren't having any problems with this. I recently tried to "help" by setting up an emqx cluster inside k8s with a load balancer running so:

  1. Be able to manage mqtt accounts thru ldap so don't need to bounce mosquitto every time
  2. Be able to have nodes restart and since behind a load balancer, always an mqtt server node endpoint available (there are 3 nodes in the cluster)

It's had no change, and actually been worse as if the LB sends HA to a different node for whatver reason, it's now failing 2-3 times a day to connect, again while no other client is reporting problems.

Also as noted above, even if everything is working properly and the MQTT integration is working, quite often before it fails if i go to the Options screen, try to connect to the existing mosquitto server, it will fail to connect. Doesn't matter if I use name, IP, or even another good server, it's like it's out of out going sockets, it's very odd. To further add to how strange this is, if I restart HA, I can do this connect test initially without any problems, and only appears to happen after a period of time running.

My HA instance is running on a Dell sff PC with 24gb or RAM, gigabit ethernet connection, tried completely different cables and switches on the network. Connecting over port 1883 unencrypted to keep things simple for testing.

Looking for ideas how to debug where it's running into problems. One thought I had was running a mosquitto instance in supervisor that bridges to whatever upstream mqtt server I am using, but seems a bit superfluous. thank you!

probot-home-assistant[bot] commented 3 years ago

mqtt documentation mqtt source (message by IssueLinks)

probot-home-assistant[bot] commented 3 years ago

Hey there @emontnemery, mind taking a look at this issue as its been labeled with an integration (mqtt) you are listed as a codeowner for? Thanks! (message by CodeOwnersMention)

emontnemery commented 3 years ago

This is quite odd indeed, the mosquitto MQTT server should not give you these kind of troubles. Which host OS and VMM do you use to run Home Assistant OS?

You could try enabling some debug log to see if it gives any clarity:

 logger:
   default: info
   logs:
     homeassistant.components.mqtt: debug
     paho.mqtt.client: debug

Also, try enabling debug log in the mosquitto server.

EmilyNerdGirl commented 3 years ago

Will do, thank you for the response! Over the weekend I set up the Mosquitto broker add-on into Supervisor, and while I was bridging it to the upstream MQTT server, and the MQTT Integration was connected to the HA side broker in Supervisor, I was able to reproduce the problem as I had cycled the add-on several times in a short period of time as I was getting the setting correct in the conf file.

EmilyNerdGirl commented 3 years ago

Ok I've turned on debugging, going to take a day or so then will cycle the connection on the mqtt server to see if I can catch it in the logs.

EmilyNerdGirl commented 3 years ago

Ok I hit it last night, no exceptions, but noticed something interesting. The logs filled with 1000s of messages for pychromecast.socket_client couldn't hit one specific Google device out of the 7 of them for some reason. Around this time the load balencer sent the Home Assistant mqtt integration to a different internal node (total coincidence on the timing as there no relationship here at all to networking, hosts, etc.) I also have a Mosquitto broker running in Supervisor, and I saw the same disconnect happen there. However the Mosquitto broker immediately reconnected, but the integration never does. There's no errors or exceptions that I can find.

2021-05-26 04:42:45 WARNING (Thread-15) [pychromecast.socket_client] [Office display(192.168.0.126):8009] Error communicating with socket, resetting connection
2021-05-26 04:42:45 ERROR (Thread-15) [pychromecast.socket_client] [Office display(192.168.0.126):8009] Error in select call: filedescriptor out of range in select()
2021-05-26 04:42:45 WARNING (Thread-15) [pychromecast.socket_client] [Office display(192.168.0.126):8009] Error communicating with socket, resetting connection
2021-05-26 04:42:45 ERROR (Thread-15) [pychromecast.socket_client] [Office display(192.168.0.126):8009] Error in select call: filedescriptor out of range in select()
2021-05-26 04:42:45 WARNING (Thread-15) [pychromecast.socket_client] [Office display(192.168.0.126):8009] Error communicating with socket, resetting connection
2021-05-26 04:43:59 WARNING (Thread-27) [homeassistant.components.mqtt] Disconnected from MQTT server [REDACTED]:1883 (1)
2021-05-26 04:44:00 DEBUG (Thread-27) [paho.mqtt.client] Sending CONNECT (u1, p1, wr1, wq0, wf1, c1, k60) client_id=b'2gykYNSgo0uSVdHfmcqHXk'
2021-05-26 04:44:02 DEBUG (Thread-27) [paho.mqtt.client] Sending CONNECT (u1, p1, wr1, wq0, wf1, c1, k60) client_id=b'2gykYNSgo0uSVdHfmcqHXk'
2021-05-26 04:44:06 DEBUG (Thread-27) [paho.mqtt.client] Sending CONNECT (u1, p1, wr1, wq0, wf1, c1, k60) client_id=b'2gykYNSgo0uSVdHfmcqHXk'
2021-05-26 04:44:14 DEBUG (Thread-27) [paho.mqtt.client] Sending CONNECT (u1, p1, wr1, wq0, wf1, c1, k60) client_id=b'2gykYNSgo0uSVdHfmcqHXk'
2021-05-26 04:44:30 DEBUG (Thread-27) [paho.mqtt.client] Sending CONNECT (u1, p1, wr1, wq0, wf1, c1, k60) client_id=b'2gykYNSgo0uSVdHfmcqHXk'
2021-05-26 04:45:02 DEBUG (Thread-27) [paho.mqtt.client] Sending CONNECT (u1, p1, wr1, wq0, wf1, c1, k60) client_id=b'2gykYNSgo0uSVdHfmcqHXk'
2021-05-26 04:45:25 INFO (MainThread) [homeassistant.components.automation.mqtt_urad] MQTT - URAD: Running automation actions
2021-05-26 04:45:25 INFO (MainThread) [homeassistant.components.automation.mqtt_urad] MQTT - URAD: Executing step call service
2021-05-26 04:45:25 DEBUG (SyncWorker_36) [paho.mqtt.client] Sending PUBLISH (d0, q0, r0, m523), 'b'/home-assistant/urad/state'', ... (2 bytes)
2021-05-26 04:45:25 DEBUG (MainThread) [homeassistant.components.mqtt] Transmitting message on /home-assistant/urad/state: '27', mid: 523
2021-05-26 04:45:35 WARNING (MainThread) [homeassistant.components.mqtt] No ACK from MQTT server in 10 seconds (mid: 523)
2021-05-26 04:46:06 DEBUG (Thread-27) [paho.mqtt.client] Sending CONNECT (u1, p1, wr1, wq0, wf1, c1, k60) client_id=b'2gykYNSgo0uSVdHfmcqHXk'
2021-05-26 04:48:06 DEBUG (Thread-27) [paho.mqtt.client] Sending CONNECT (u1, p1, wr1, wq0, wf1, c1, k60) client_id=b'2gykYNSgo0uSVdHfmcqHXk'
2021-05-26 04:50:06 DEBUG (Thread-27) [paho.mqtt.client] Sending CONNECT (u1, p1, wr1, wq0, wf1, c1, k60) client_id=b'2gykYNSgo0uSVdHfmcqHXk'
2021-05-26 04:50:25 INFO (MainThread) [homeassistant.components.automation.mqtt_urad] MQTT - URAD: Running automation actions

Right now everything else seems to be working in HA except the MQTT integration in HA is just keeps repeating Sending CONNECT and No ACK from MQTT server in 10 seconds (mid: 564) and any attempts to reconfigure then connect from the MQTT Integration do not work, and no messages in the logs either when I try to reconfigure the integration. Restarted HA to get mqtt traffic moving again.

EmilyNerdGirl commented 3 years ago

Should also note during this time where the integration becomes unresponsive for connecting to any mqtt server, it didn't matter what MQTT server I tried to connect to. I couldn't connect to the remote mqtt cluster in k8s, nor the local mqtt server running in Supervisor on the same box for HA. Also again other mqtt clients at the same time are connected or reconnected without a problem to either MQTT server instance at the same time the integration is having issues.

Here's a screen shot of the options screen after it comes back:

image

emontnemery commented 3 years ago

Interesting. I think Error in select call: filedescriptor out of range in select() indicates something is leaking file descriptors, just as you suspected earlier.

Do you have any custom integrations enabled, if you do could you disable them just to rule out it's one of them? Next time the issue happens, can you try to do some Linux level debugging to check which fd's the home assistant process has opened?

EmilyNerdGirl commented 3 years ago

I disabled as much as I could to keep me in a bare minimum functioning state: ESPHome, Frigate, Google Cast, HACS, Supervisor, Insteon, LIFX, mqtt, ONVIF, Roku (can't see how to disable this one), Zwave JS and ZHA.

emontnemery commented 3 years ago

OK, I see. Then i have no better idea than to reproduce again and use lsof or some similar tool to try to see what's going on.

EmilyNerdGirl commented 3 years ago

Let me also see if I can reproduce this on a 2nd system in a VM, first without a backup, then with.

EmilyNerdGirl commented 3 years ago

Ok I stood up a vm with a clean install with the mqtt client connected to the same mqtt server cluster with only the MQTT integration enabled.

EmilyNerdGirl commented 3 years ago

Just an update, starting to suspect it is indeed another integration causing it it to run out of either sockets of file descriptors/open files. I've had 10 integrations disabled, and even with some unrelated k8s issues the last day or so, this integration has reconnected without any problems. The 2nd instance I have of the new install as well. Going to see if I can narrow things a bit, but likely not related to this integration and this is being impacted by something else.

EmilyNerdGirl commented 3 years ago

I am going to close this for now. Looks like indeed another integration (narrowed it down to a couple left to enable) may have a leak as noticing not only am I no longer having issues with this integration inability to reconnect, but also I had a strange (apparently not normal) slow memory leak of about 100mb a day that is now gone as well.