homieiot / homie-esp8266

πŸ’‘ ESP8266 framework for Homie, a lightweight MQTT convention for the IoT
http://homieiot.github.io/homie-esp8266
MIT License
1.36k stars 308 forks source link

MQTT not reconnecting after days of uptime vs. minutes / hours #320

Closed johnjore closed 5 years ago

johnjore commented 7 years ago

Hi!

My mosquitto server goes offline from time to time (separate issue to resolve), but when this happens, homie does not always reconnect after the server is rebooted and mosquitto is again running. I can only reproduce the issue after letting it run for several days, not minutes or hours.

Here I left it running for a few days, and this morning I have this message (after the server crash):

MQTT packet acknowledged, packetId: 7459 ... Looping through the various sensor, but there has was no change in their state, until: ... Heap size: 30968 Temperature: 24.20 Β°C Humidity: 43.20 % Lux: 21 lx Motion: CLOSED MQ:23.74 LPG:0.00 ppm CO:0.00 ppm SMOKE:0.00 ppm βœ– MQTT disconnected Triggering MQTT_DISCONNECTED event... MQTT disconnected, reason: 0 ↕ Attempting to connect to MQTT...

After 8h in this state, its not moved on and it never tries to reconnect, but other sensors have. I've not managed to figure out what, or why, it does not reconnect. As I can see other sensors are reporting in, the problem is not with the mosquitto server. Log file on the mosquitto server shows no traffic from this node when it is in this state.

After resetting the device, and then manually stopping the mosquitto service:

MQ:23.74 LPG:0.00 ppm CO:0.00 ppm SMOKE:0.00 ppm βœ– MQTT disconnected Triggering MQTT_DISCONNECTED event... MQTT disconnected, reason: 0 ↕ Attempting to connect to MQTT... Retrying (500ms)... ↕ Attempting to connect to MQTT... Retrying (1000ms)... ↕ Attempting to connect to MQTT... Retrying (2000ms)... ↕ Attempting to connect to MQTT... Retrying (4000ms)... ↕ Attempting to connect to MQTT...

It all looks normal and when the mosquitto service is started, homie reconnects and all is good.

Versions: Using your mqtt client and homie (2) as of March 10th.

Any idea as to why it fails to reconnect after days of uptime vs minutes or hours? Is there additional debug information I can provide?

JJ

marvinroger commented 7 years ago

This one is going to be hard to debug. I'll try to run an ESP8266 for some days, but I am afraid the issue comes from the ESPAsyncTCP library (or maybe there's a memory leak somewhere).

johnjore commented 7 years ago

Hi! I think what's really happening is that the mosquitto server has not fully crashed, but partly responds, causing the client to hang during reconnect. Maybe waiting for the initialization to fully complete?

I.e, not related to uptime, which is why when fully stopping mosquitto, the problem does not materialize?

marvinroger commented 7 years ago

When this happens (when Homie is stuck), can you connect to the broker with another client? On your phone or computer?

johnjore commented 7 years ago

Not tried that specifically. SSH responds, but logons do not complete after entering username / password. Just hangs.

After the powercycle, the Homie device on my desk remains in its stuck state, but other devices report in.

I can test next time it happens, but pretty sure mosquitto is responding to the initial connect, but is not functional. Same as SSH.

marvinroger commented 7 years ago

I forgot that I have a device that is running for some months now, without any issues, so there's no obvious errors in the code. Unfortunately, ESP8266 are small devices that might not be as bulletproof as other devices, so if your broker is not running rock stable, don't expect the ESP8266s to work great.

johnjore commented 7 years ago

Does the ESP8266 not have a watchdog timer? If it does, this may be the situation to enable it? (Restart on a stuck process?)

johnjore commented 7 years ago

Further testing shows that after restarting both wifi access points (same SSID, but different channels), some sensors don't fully reconnect to the MQTT server. Strangely they are marked as "$online true", but "$stats" are not updated. (Signal and Uptime). Restarted the mosquitto service and $stats began to update themselves again.

Note that this is on a new box where mosquitto is running without issues, and a sensor on my desk was able to read/write to the service.

marvinroger commented 7 years ago

Can you test with the AsyncMqttClient v0.7.0?

johnjore commented 7 years ago

No, its not fixed in 0.7.1

Firmware updated a number of devices to 0.7.1 and applied new config on the wifi box causing the devices to disconnect and reconnect to the wifi.

A couple of devices didn't re-connect to the MQTT server correctly. They respond to ping, but no data is logged on the MQTT server, including the $stats values. Restarted mosquitto service, and data, $stats and sensor data, appeared for the MQTT server again.

JJ

marvinroger commented 7 years ago

@johnjore please wait for the next beta, things should stabilize. πŸ˜‰

johnjore commented 7 years ago

:) Thanks!

usatenko commented 6 years ago

I found that using QoS=2 may cause the problem, I had some such settings in nodered mqtt nodes. Do you have some like this? I see a message like this on mosquitto "1516821985: Client iot home-5ccf7fa4297b has exceeded timeout, disconnecting." and after that homie falls into reconnection loop.

johnjore commented 6 years ago

What QoS value have you had the most success with if 2 causes problems with client or server? 0 or 1?

usatenko commented 6 years ago

Unfortunately, after a couple of tests I see that with qos 0 and 1 situation almost the same. It just stops reconnecting more rarely. In general I do not know what to do anymore. Among all devices I have the one that does not write to mqtt anything and it works surprisibgly stable, and those ones which write stop connecting more than once a day. I saw the mosquitto log and every time it happens it writes "connection time out ... disconnecting". @marvinroger, what do you think it can be?

johnjore commented 6 years ago

I moved one of my sensor packs to a Raspberry Pi's USB port and logged the output to a text file to see what is going on. I removed the outputs from the sensors and some of the repetitive outputs, which is why the uptime jumps forward. During the night this is what happened:

γ€½ Sending statistics...
  β€’ Wi-Fi signal quality: 100%
  β€’ Uptime: 17732s
βœ– Wi-Fi disconnected
Triggering WIFI_DISCONNECTED event...
Wi-Fi disconnected, reason: 8
↕ Attempting to connect to Wi-Fi...
γ€½ Sending statistics...
  β€’ Wi-Fi signal quality: 100%
  β€’ Uptime: 17734s
βœ– setNodeProperty(): impossible now
βœ– Wi-Fi disconnected
Triggering WIFI_DISCONNECTED event...
Wi-Fi disconnected, reason: -53
↕ Attempting to connect to Wi-Fi...
γ€½ Sending statistics...
  β€’ Wi-Fi signal quality: 100%
  β€’ Uptime: 17736s
βœ” Wi-Fi connected, IP: x.x.x.x
Triggering WIFI_CONNECTED event...
WIFI_CONNECTED
MAC Address: xx:xx:xx:xx:xx:xx
Wi-Fi connected, IP: x.x.x.x
↕ Attempting to connect to MQTT...
βœ– setNodeProperty(): impossible now
βœ– setNodeProperty(): impossible now
γ€½ Sending statistics...
  β€’ Wi-Fi signal quality: 26%
  β€’ Uptime: 17796s
βœ– setNodeProperty(): impossible now
βœ– setNodeProperty(): impossible now
βœ– Wi-Fi disconnected
Triggering WIFI_DISCONNECTED event...
Wi-Fi disconnected, reason: 2
↕ Attempting to connect to Wi-Fi...
γ€½ Sending statistics...
  β€’ Wi-Fi signal quality: 100%
  β€’ Uptime: 39929s
βœ– Wi-Fi disconnected
Triggering WIFI_DISCONNECTED event...
Wi-Fi disconnected, reason: 2
↕ Attempting to connect to Wi-Fi...
γ€½ Sending statistics...
  β€’ Wi-Fi signal quality: 100%
  β€’ Uptime: 39930s
βœ– Wi-Fi disconnected
Triggering WIFI_DISCONNECTED event...
Wi-Fi disconnected, reason: -55
↕ Attempting to connect to Wi-Fi...
γ€½ Sending statistics...
  β€’ Wi-Fi signal quality: 100%
  β€’ Uptime: 39930s
βœ– Wi-Fi disconnected
Triggering WIFI_DISCONNECTED event...
Wi-Fi disconnected, reason: -53
↕ Attempting to connect to Wi-Fi...
γ€½ Sending statistics...
  β€’ Wi-Fi signal quality: 100%
  β€’ Uptime: 39932s
βœ– setNodeProperty(): impossible now
βœ” Wi-Fi connected, IP: x.x.x.x
Triggering WIFI_CONNECTED event...
WIFI_CONNECTED
MAC Address: x:x:x:x:x:x
Wi-Fi connected, IP: x.x.x.x
↕ Attempting to connect to MQTT...
βœ– setNodeProperty(): impossible now
βœ– setNodeProperty(): impossible now
γ€½ Sending statistics...
  β€’ Wi-Fi signal quality: 100%
  β€’ Uptime: 39992s
βœ– setNodeProperty(): impossible now

While it reconnects to the WiFi, it never reconnects to the MQTT server.

Restarted the MQTT Server:

γ€½ Sending statistics...
  β€’ Wi-Fi signal quality: 100%
  β€’ Uptime: 57036s
βœ– setNodeProperty(): impossible now
βœ– setNodeProperty(): impossible now
βœ– MQTT disconnected
Triggering MQTT_DISCONNECTED event...
MQTT disconnected, reason: 0
↕ Attempting to connect to MQTT...
Sending initial information...
βœ” MQTT ready
Triggering MQTT_READY event...

Homie is now reporting data to the MQTT server as normal again. Any clues as to what is going on?

timpur commented 6 years ago

Could it be due to the server IP changing from the crash. Need to check if this could effect homie and reconnecting to mqtt when I have a computer and look at the code

johnjore commented 6 years ago

Hm... doubt it as no ip addresses changed. Static ip on client and mqtt server. Only the WiFi Access Point changed, twice.

timpur commented 6 years ago

Will try replicate when i have some time, but ive tested this before (while back, could of broke... ) but restarting the broker or ap, shouldn't get it stuck in a connection loop.

Any success with the latest everything (mainly the esp8266 arduino core 2.4.0)?

Looks like the mqtt client doesn't detect that its disconnected .....

Could use Homie.getMqttClient().connected() to get the status of the underlying client lib connection status. Maybe print this out at every homie event to see whats happening with the mqtt client ??

johnjore commented 6 years ago

I upgraded all the libraries yesterday (including esp core 2.4) but during the night there was no change of AP.

I've added the connected status to the serial output, and will report back when device switches AP and let you know how it goes.

usatenko commented 6 years ago

As a workaround I think we can reboot ESP on each "onDisconnect" event of AsyncMqttClient. It is not ideal but sounds as a solution in my case. How do you think?

johnjore commented 6 years ago

Its a sledgehammer approach, but better than the devices going offline, needing a hard-reset, or resetting the MQTT server.

timpur commented 6 years ago

Ideally we should fix this issue and not work around it. So send me updates and progress @johnjore

johnjore commented 6 years ago

So, it finally happened again and multiple devices went offline at the same time.

On the device with logging and updated to the latest packages at the time, "Homie.getMqttClient().connected(): 1" is reported throughout.

After x hours, it never recovered and I power cycled it. Connected fine and all ok again. I believe a blip on the wifi caused the devices to disconnect, and never reconnect.

usatenko commented 6 years ago

As a workaround, in the main loop just use next code:

long lastWifiConnected = 0;
long lastWifiDisconnected = millis();
long lastMqttConnected = 0;
long lastMqttDisconnected = millis();
bool restartInitiated = false;
void onHomieEvent(const HomieEvent& event) {
  switch(event.type) {
    case HomieEventType::WIFI_CONNECTED:
      lastWifiConnected = millis();
      break;
    case HomieEventType::WIFI_DISCONNECTED:
      lastWifiDisconnected = millis();
      break;
    case HomieEventType::MQTT_READY:
      lastMqttConnected = millis();
      break;
    case HomieEventType::MQTT_DISCONNECTED:
      lastMqttDisconnected = millis();
      break;
  }
}
..
  if ((lastMqttDisconnected > lastMqttConnected && millis() - lastMqttDisconnected > 120 * 1000 || lastWifiDisconnected > lastWifiConnected && millis() - lastWifiDisconnected > 120 * 1000) && !restartInitiated) {
    restartInitiated = true;
    Serial.println("Restarting");
    ESP.restart();
  }
timpur commented 6 years ago

This wont help if the client doesn't recognise this it is actually disconnected. Sounds like a bug in the MQTT client. @johnjore is is possible to get the console log ? Did is show any event triggers ?

johnjore commented 6 years ago

Hope this helps. I washed the log file down from 435MB (too large for Notepad++) by removing large chunks of duplicate information like Wifi signal quality. Homie.getMqttClient().connected(): is always 1. Thousands of lines with "βœ– setNodeProperty(): impossible now" were also removed. No events were deleted.

Setup completed
βœ” MQTT ready
Triggering MQTT_READY event...
Calling setup function...
γ€½ Sending statistics...
  β€’ Wi-Fi signal quality: 100%
  β€’ Uptime: 27s
Homie.getMqttClient().connected(): 1
Homie.getMqttClient().connected(): 1
βœ– MQTT disconnected
Triggering MQTT_DISCONNECTED event...
MQTT disconnected, reason: 0
↕ Attempting to connect to MQTT...
↕ Attempting to connect to MQTT...
Sending initial information...
βœ” MQTT ready
Triggering MQTT_READY event...
Homie.getMqttClient().connected(): 1
Homie.getMqttClient().connected(): 1
βœ– MQTT disconnected
Triggering MQTT_DISCONNECTED event...
MQTT disconnected, reason: 0
↕ Attempting to connect to MQTT...
Sending initial information...
βœ” MQTT ready
Triggering MQTT_READY event...
γ€½ Sending statistics...
  β€’ Wi-Fi signal quality: 100%
  β€’ Uptime: 89945s
Homie.getMqttClient().connected(): 1
Homie.getMqttClient().connected(): 1
βœ– MQTT disconnected
Triggering MQTT_DISCONNECTED event...
MQTT disconnected, reason: 0
↕ Attempting to connect to MQTT...
↕ Attempting to connect to MQTT...
Sending initial information...
βœ” MQTT ready
Triggering MQTT_READY event...
γ€½ Sending statistics...
  β€’ Wi-Fi signal quality: 100%
  β€’ Uptime: 167521s
Homie.getMqttClient().connected(): 1
Homie.getMqttClient().connected(): 1
βœ– MQTT disconnected
Triggering MQTT_DISCONNECTED event...
MQTT disconnected, reason: 0
↕ Attempting to connect to MQTT...
↕ Attempting to connect to MQTT...
Sending initial information...
βœ” MQTT ready
Triggering MQTT_READY event...
γ€½ Sending statistics...
  β€’ Wi-Fi signal quality: 100%
  β€’ Uptime: 239060s
Homie.getMqttClient().connected(): 1
βœ– Wi-Fi disconnected
Triggering WIFI_DISCONNECTED event...
Wi-Fi disconnected, reason: 8
↕ Attempting to connect to Wi-Fi...
γ€½ Sending statistics...
  β€’ Wi-Fi signal quality: 100%
  β€’ Uptime: 305432s
Homie.getMqttClient().connected(): 1
βœ” Wi-Fi connected, IP: x.x.x.x
Triggering WIFI_CONNECTED event...
WIFI_CONNECTED
MAC Address: 5C:CF:7F:24:83:98
Wi-Fi connected, IP: x.x.x.x
↕ Attempting to connect to MQTT...
γ€½ Sending statistics...
  β€’ Wi-Fi signal quality: 100%
  β€’ Uptime: 305433s
Homie.getMqttClient().connected(): 1
timpur commented 6 years ago

Intresting, will need to look into this more, but it's hard since it's not easly replicated

johnjore commented 6 years ago

Is it possible to force the MQTT client to disconnect when wifi disconnects?

Worst case, a reboot command could be added when a wifi disconnect event happens, not really a solution, but better than needing to reboot the device manually, as it never recovers on its own as it is now.

As an FYI, I currently use homie on all devices with a fixed 5V/3.3V supply like switches, dimmers, sockets and sensor packs. These don't have push buttons to reset the ESP, (integrated into the walls) so I need to toggle the fuse. This is anything but ideal, and preventing me deploying additional devices.

timpur commented 6 years ago

Yes, maybe forcing mqtt lib to disconnect when wifi goes down. I'll have a think about other solutions

Also just to be sure, what version of homie and version of Arduino core ? And any other details related please.

johnjore commented 6 years ago

Hi! didn't realize the post had been edited:

homie-esp8266-develop at version 2.0.0 (Downloaded Jan 28th) async-mqtt-client-master at version 0.8.2 Using core 'esp8266' from platform in folder: esp8266\hardware\esp8266\2.4.0

JJ

mwildbolz commented 6 years ago

Same Problem on my side. WiFi disconnects and reconnects, but MQTT hangs in a reconnect loop (led blinking fast). Is there anything planned to fix this at the moment??

timpur commented 6 years ago

The issue is odd, i need to swap over to mqtt to have a look and see if i can find the issue, but atm im pushed for time....

philomatic commented 6 years ago

I also have this problem.

Wifi reconnect works (homie clients are pingable), but MQTT won't come up again. This does not happen always.

A second power cycle on the wifi AP (in my case a RPi, so a simple ifdown wlan0 && ifup wlan0) forces the homie clients to come back (in MQTT).

Best regards

AlexValois commented 5 years ago

I do experience the exact same issue (led fast blinking and data no more sent to MQTT broker). How do you recommend to fix it ? I am using Homie 2.0.0 Dev release. Is it fixed in 3.0.1 ?

Thanks

euphi commented 5 years ago

There is an issue, where MQTT reconnects, but the LED is not switched off (so continue in "fast blinking".) This has been fixed in V3 branch.

So check, If there is really no MQTT traffic anymore. If there is no traffic: Do you use a hostname for MQTT or the IP address?

AlexValois commented 5 years ago

There is an issue, where MQTT reconnects, but the LED is not switched off (so continue in "fast blinking".) This has been fixed in V3 branch.

So check, If there is really no MQTT traffic anymore. If there is no traffic: Do you use a hostname for MQTT or the IP address?

Do you mean does my Homie client configuration is using an IP or a hostname to connect to MQTT? If so, my clients are using the IP. No FQDN used to connect.

I have seen that I can ping my Homie clients but they were no more sending data to MQTT when their led was blinking fast.

This morning, I have changed the source code of BootNormal.cpp file as follow _void BootNormal::_mqttConnect() { if (!Interface::get().disable) { if (Interface::get().led.enabled) Interface::get().getBlinker().start(LED_MQTT_DELAY); _mqttConnectNotified = false; Interface::get().getLogger() << F("Γ’~F~U Attempting to connect to MQTT...") << endl; Interface::get().getMqttClient().connect(); } }_ I found this trick here As of now, it seems OK for my Homie clients since last 9 hours.

euphi commented 5 years ago

See #545 :

Fixed in 2.0.1-pre1 and #develop-v3