marvinroger / async-mqtt-client

šŸ“¶ An Arduino for ESP8266 asynchronous MQTT client implementation
MIT License
835 stars 266 forks source link

Client stuck on MQTT reconnect #248

Closed rousveiga closed 3 years ago

rousveiga commented 3 years ago

Hello! Sometimes, my device disconnects from the MQTT broker and never reconnects again while the sketch is running. If I reset the device, it can connect again without any problems.

I have a MWE and extensive logs. The logfile is pretty big, so I separated the part from when it failed: https://pastebin.com/SPdQK3z5

The #include "M5StickPlus.h" and M5.begin(false, true, true); in the MWE are there because I'm using a M5 Stick-C Plus for testing, which has a ESP32-PICO-D4 inside.

I can provide more info if necessary. Thanks in advance!

GioTB commented 3 years ago

I have a similar problem, a few times my esp32 disconnects from the broker and I had to reboot it to re stablish the connection, I wasn't sure if this was a library bug or it was mine.

rousveiga commented 3 years ago

a few times my esp32 disconnects from the broker and I had to reboot it to re stablish the connection

@GioTB That's exactly the same behavior I get. Do your logs look like mine, i.e. do they show the client getting stuck in the reconnecting stage?

proddy commented 3 years ago

I have a similar occurrence when if the broker is down for a while and comes back up (say >5mins when doing a server upgrade) AysncMQTTClient fails to reconnect. I'll see if I can create some small test code to reproduce it.

luebbe commented 3 years ago

IIRC @rousveiga's other post correctly, you are already working with the develop branch? So the current recommendation "please try with develop and see if it works" doesn't help here...

rousveiga commented 3 years ago

Yes, I'm using develop indeed.

GioTB commented 3 years ago

@rousveiga my implementation itĀ“s different to yours, but it does basically the same thing (i implemented it with freertos timer for the reconnection), each time the "onDisconnect" callback it called it attempts a reconnect and leave the timer on until it connects (and yes, it does get stuck trying to reconnect), since i have the device far away from me, i had to implement a reboot of the ESP32 in case it pass too much time and it coulndĀ“t connect. The error itĀ“s not recurrent at all, for me it has happend like 3-4 times in a period of about 4 month. one thing to notice itĀ“s that my broker (i use flespi.io) says that the device "connects and disconnects" several times, thatĀ“s odd. image

GioTB commented 3 years ago

@luebbe i just realize i was using the master branch, iĀ“m using platformio and i use the "ottowinter" repo, now iĀ“m going to try with the develop branch of this repo. By the way wich itĀ“s the difference between ottowinters repo and this one?

luebbe commented 3 years ago

@GioTB This is the original where @bertmelis has made a lot of fixes to the develop branch in the past months. I haven't looked at ottowinters fork but I guess he has also tried to fix some of the bugs in his fork.

cyber-junkie9 commented 3 years ago

@luebbe i just realize i was using the master branch

any update ? i have multiple node and i end up with restarting router most of the time to fix it :/

GioTB commented 3 years ago

@cyber-junkie9 i canĀ“t tell you yet if it works, at least until now i havenĀ“t lose connection, but there is a quick patch that you can implement on your code so you donĀ“t have to manually reboot your nodes: add a counter, so if it goes further than a certain amount of reconnection attempts the node should restart. i had to implement that on my code so i wouldnĀ“t lose the device for ever (my main device itĀ“s about 2 hrs far from me)

rousveiga commented 3 years ago

add a counter, so if it goes further than a certain amount of reconnection attempts the node should restart. i had to implement that on my code so i wouldnĀ“t lose the device for ever (my main device itĀ“s about 2 hrs far from me)

@GioTB Might do this myself, my devices are away from me as well.

bertmelis commented 3 years ago

Is everybody who experiences this using a ESP32 or does it also occur on ESP8266?

proddy commented 3 years ago

I wrote a mini app to try and recreate the problem, but couldn't anymore!

luebbe commented 3 years ago

I'm using only ESP8266 and I have one device that sometimes has MQTT reconnection issues. Admittedly I was too lazy to dig deeper, since power cycling it usually solves the issue.

GioTB commented 3 years ago

@bertmelis I'm only using ESP32

rousveiga commented 3 years ago

@GioTB I just realized something. From your comments, I infer that you get your onDisconnect handler called several times; as in, the device attempts to reconnect, it fails, and this loop repeats forever unless you reboot. And the fix you mention can be implemented in the sketch side of things, rather than patching the library. Am I correct?

For me, the onDisconnect handler only gets called once, I call connect again, and that single reconnection is what happens forever unless reboot.

bertmelis commented 3 years ago

I'd love to see a wireshark replay of this.

It is also not clear to me which device actually initiates the continuous disconnections: the client or the broker.

rousveiga commented 3 years ago

I'd love to see a wireshark replay of this.

I'll try to provide it.

It is also not clear to me which device actually initiates the continuous disconnections: the client or the broker.

It's not clear to me either. I'm experiencing this behavior with the Mosquitto add-on for Home Assistant; the logs I have access to get cut off after a certain point, and I haven't yet found a way to get the full logs. I will try running my MWE with a different broker.

Another thing I want to do is to get some logs of the AsyncTCP side of things, to see if it helps the investigation.

luebbe commented 3 years ago

@rousveiga I remember reading something (probably on heise online, a German computer magazine, but I can't find the article) that there was a problem with a recent update to the mosquitto add-on for HA. People had to revert to a previous version.

I'm using HA, but my mosquitto runs on a separate raspberry.

rousveiga commented 3 years ago

@luebbe That's interesting. I'll look it up. Thanks!

@bertmelis I just remembered, even though I haven't confirmed it 100%, that my other MQTT devices (running Espurna) seem to work fine, so at first I thought it would be a client issue.

bertmelis commented 3 years ago

Does Espurna use pubsubclient or also this lib? I had the impression it uses this.

GioTB commented 3 years ago

@GioTB I just realized something. From your comments, I infer that you get your onDisconnect handler called several times; as in, the device attempts to reconnect, it fails, and this loop repeats forever unless you reboot. And the fix you mention can be implemented in the sketch side of things, rather than patching the library. Am I correct?

For me, the onDisconnect handler only gets called once, I call connect again, and that single reconnection is what happens forever unless reboot.

@rousveiga Yes!, precisely that!, so our reconection loop itĀ“s different, another thing that i do itĀ“s to activate a timer wich attempts to reconnect every 5 seconds, so the "connect" itĀ“s called several times, in the device that i have on the field sometimes, it takes up to 14 tries until it reconnects, but this could be for the internet connection, or some other problems, i canĀ“t say itĀ“s cause of the library, and the most important thing itĀ“s that after a while it manages to connect back to the broker. Worth noting that once i changed to the develop branch i havenĀ“t had the issue anymore, i even put a ESP32 disconnecting every 20 seconds so it would attempt the reconnect on itĀ“s own, and it always reconnect with no problem.

bertmelis commented 3 years ago

For reference: only try to reconnect if the previous attempt has failed. The broker disconnects the oldest client when a new one with the same ID tries to connect.

This reconnect loop could be a timing issue.

GioTB commented 3 years ago

@bertmelis Thanks!!, one question, the "onDisconnect" callback itĀ“s called each time the client fails to connect? (and obviosly after it had a succesull connection that drops) right?, the point of my question itĀ“s that i guess i could call the "connect" method only when the "onDisconnect" itĀ“s triggered. Currently iĀ“m calling Connect every 5 seconds, independant on the "onDisconnect" callback (i do this with a freertos timer, that only itĀ“s stoped once the "onConnect" callback itĀ“s called)

bertmelis commented 3 years ago

@GioTB Yee, the onDisconnect is called every time. Also for example when the client can't connect and the connection attempt timeouts (the asynctcp lib does this).

rousveiga commented 3 years ago

Does Espurna use pubsubclient or also this lib? I had the impression it uses this.

@bertmelis I checked it out and, while you can configure the use of other libraries, the default is this one, yes.

I looked into the issues and found people experiencing the same problem: https://github.com/xoseperez/espurna/issues/2365, https://github.com/xoseperez/espurna/issues/2112.

Looks like Espurna maintainers have their own fork and fix: https://github.com/mcspr/async-mqtt-client/commit/c1fcfd1. I'm going to try to apply this patch and see if it works.

the point of my question itĀ“s that i guess i could call the "connect" method only when the "onDisconnect" itĀ“s triggered.

@GioTB Yes, that's exactly what I do.

cyber-junkie9 commented 3 years ago

I'm going to try to apply this patch and see if it works.

Worked ?

rousveiga commented 3 years ago

@cyber-junkie9 So far, it looks like it - I have three devices working since Thursday - but I'm going to keep monitoring them for a few more days just to be sure.

bertmelis commented 3 years ago

There is indeed a flaw: when the TCP connection is made, the MQTT ping system is not working yet. Since the TCP connection is made, there is nothing to timeout.

This fix will indeed solve that I think. Not sure though why a broker would stop communicating between accepting the TCP connection and the CONNECT packet.

EDIT: there might be 2 issues: one with a single reconnect that gets stuck and one with a continuous connect/reconnect loop. I'm talking about the single event here.

EDIT2: I might be mistaken, working from a smartphone screen...

rousveiga commented 3 years ago

It's been a week without trace of this issue, so I'd say the fix did indeed work.

Here's my patched file. I tried to add the setRxTimeout invocations at the same points, but the library seems to have been refactored since it was forked, so it might be a bit off.

bertmelis commented 3 years ago

It's been a week without trace of this issue, so I'd say the fix did indeed work.

Here's my patched file. I tried to add the setRxTimeout invocations at the same points, but the library seems to have been refactored since it was forked, so it might be a bit off.

You're welcome to create a PR (preferably develop branch). I can merge from my poolside lounger on my phone.

rousveiga commented 3 years ago

You're welcome to create a PR (preferably develop branch).

Okay, I will!

I can merge from my poolside lounger on my phone.

Enjoy! šŸ˜„

GioTB commented 3 years ago

@rousveiga @bertmelis great work! thanks!

bertmelis commented 3 years ago

Issue closed by merging the PR.