peterhinch / micropython-mqtt

A 'resilient' asynchronous MQTT driver. Recovers from WiFi and broker outages.
MIT License
549 stars 116 forks source link

Problem with Wifi disconnecting every few minutes with Reason 8 ASSOC_LEAVE on ESP32 #51

Closed solarjoe closed 3 years ago

solarjoe commented 3 years ago

I have a problem where the wifi connection of the ESP32 (M5 Atom) is lost every few minutes. I read a lot posts in the expressif forum and the esp32 forum and tried several things (DHCP lease, deactivating AP_IF, ...) but no success so far.

The reason stated in the console is

wifi: STA_DISCONNECTED, reason:8

which is

ASSOC_LEAVE
Disassociated, because the sending STA is leaving (or has left) BSS.

For debugging I added some things to the wifi_coro callback

async def wifi_han(state):

    log.debug('wifi_han called with state {0}'.format(state))
    log.debug('wlan isconnected {0}, mqtt isconnected {1}'.format(wlan.isconnected(), client.isconnected()))

    if state:
        log.info('WiFi is up.')
    else:
        log.info('WiFi is down (outages: {0})'.format(outages))

And the log right before the connection loss is always like

wifi_han called with state False
wlan isconnected True, mqtt isconnected False
WiFi is down
wifi:state: run -> init (0)
wifi:pm stop, total sleep time: 91068370 us / 97278642 us
wifi:new:<6,0>, old:<6,0>, ap:<255,255>, sta:<6,0>, prof:1
wifi: STA_DISCONNECTED, reason:8
MOD: check status network.STAT_CONNECTING, sleep

Afterwards there is a normal reconnect.

Now that looks strange to me, because the callback is called with False in _reconnect().

And calling the isconnected() functions for Wifi wlan.isconnected() and MQTT client client.isconnected() in that callback tells me

wlan isconnected True, mqtt isconnected False

This means that MQTT is trying to reconnect, thinks it has lost the network connection, but when I ask the Wifi function it seems to think that the network connection is still alive.

I have no clue on how to interpret this. Could it be that there is no real reason for MQTT to think that the connection was lost but trigger a reconnect anyway?

Everything I found so far indicates that the ESP32 is disconnecting itself, not being kicked out by the router. The only entries in the router log are that the ESP32 is disconnecting normally and connecting again. There are many other devices working fine running on that router, e.g. an ESP8266 and some Raspberries.

Any ideas?

peterhinch commented 3 years ago

I haven't observed this behaviour. Before trying to debug this I suggest going back to basics: it's saved me time in the past.

In my own location a connection tends to drop about once an hour. Once every few minutes sounds excessive. The fact that other devices - notably an ESP8266 - works does indeed exonerate the router which leads me to query the ESP32.

Have you checked for possible hardware issues? Voltage drops when a device uses WiFi are a very common issue. I'd also consider running some trivial code which tries to keep a connection open. Either run official mqtt.simple, or simpler still just create a socket and regularly send a line from the ESP to some other device and see how long the link stays up. The ESP32's I've tested have been OK (I may have been lucky), but I have come across atrocious ESP8266's.

solarjoe commented 3 years ago

Atrocious ESP8266s regarding mqtt_as or in general? :)

I already tried two different ESP32 (M5 Atom Lite), same issue, both drop out in a similar way.

Voltage drops were also on my list, one of the ESP32 was running from a non-powered USB2.0 hub together with a mouse and a webcam, but changing to a default cell phone power supply (5V/2A) did not change a thing.

I was running the same code with vanilla mqtt.robust before and changed to mqtt_as because it did not recover when the router was gone for a few minutes (firmware update, etc.). But I did not notice this behaviour with mqtt.robust. But I will try with mqtt.simple as you suggested.

Since I did some other changes to the code, I will put the mqtt.simple back in for testing. Do you have a simple example with a socket I could use?

peterhinch commented 3 years ago

Atrocious ESP8266s

ESP8266 boards with major connectivity problems, regardless of the code being run. Some suffered outages two metres away from the AP.

If mqtt.robust did not show this behaviour then I'm completely baffled. There is probably little point in running .simple or any other tests as the robust test seems to have demonstrated that the WiFi link is stable.

All I can say is that this is the first report I've had of this behaviour.

solarjoe commented 3 years ago

Ok, I replaced mqtt_as with umqtt.simple and have two ESP32 now running in parallel with the different libraries.

And the one with mqtt_as is disconnecting all the time.

The code is "subscribe only" and at first I forgot to add a ping() with umqtt.simple and the behavior looked the same as with mqtt_as. But after adding the ping() it is running fine now.

Could it be that mqtt_as somehow forgets to keep the connection alive? Will try that later with manual ping().

solarjoe commented 3 years ago

Could there be a scenario where the keepalive function will kill the connection?

async def _keep_alive(self):
    while self.isconnected():
        ....
    self._reconnect()

mqtt_as will disconnect the network interface only here:

async def _keep_connected(self):
    while self._has_connected:
        if self.isconnected():  # Pause for 1 second
             ....       
        else:
            self._sta_if.disconnect()

Could there be a weird scenario where mqtt_as forgets to ping(), looses the connection to the broker, then thinks the wifi is also down and then actually disconnect it?

solarjoe commented 3 years ago

I added some dprint() lines to mqtt_as.py, e.g. right before the only place where mqtt_as could disconnect the network:

async def _keep_connected(self):
        while self._has_connected:
            if self.isconnected():  # Pause for 1 second
                await asyncio.sleep(1)
                gc.collect()
            else:
                self.dprint('mqtt_as will disconnect sta_if now!')
                self._sta_if.disconnect()
                await asyncio.sleep(1)

and indeed it looks like mqtt_as is disconnecting the network itself. This is the console and right before the connection is lost the output appears:

wlan isconnected True, mqtt isconnected True (outages: 45) # everything fine here, not sure what happens
_reconnect? True # but the function _reconnect() is called with True path
wlan isconnected True, mqtt isconnected False # now the network seems to be available but the broker not?
WiFi is down (outages: 46)
mqtt_as will disconnect sta_if now!
wifi:state: run -> init (0)
wifi:pm stop, total sleep time: 79904681 us / 95229870 us
wifi:new:<6,0>, old:<6,1>, ap:<255,255>, sta:<6,1>, prof:1
wifi: STA_DISCONNECTED, reason:8

Could this really be the case or am I missing something? In which case should mqtt_as disconnect?

Looking at the source of _keep_connected() this sounds to me like it will disconnect the network if the broker is gone or mqtt_as thinks the client is disconnected.

But why not try to reconnect to the broker and kill the complete network connection? Is that supposed to be something like a clean slate thing? Or maybe I am just not getting the idea.

kevinkk525 commented 3 years ago

What are your ping interval settings? What ping interval were you using with mqtt.simple/robust? What does the mqtt broker have in its log?

solarjoe commented 3 years ago

ping_interval is 0, the normal keepalive in mqtt.simple/robust/mqtt_as is usually 60 or 120.

The broker is Mosquitto on a Raspberry Pi and the log is full of these entries:

1610573855: New connection from 192.168.2.135 on port 1883.
1610573855: New client connected from 192.168.2.135 as 5002919232a0 (c1, k60).
1610573945: Client 5002919232a0 has exceeded timeout, disconnecting.
1610573945: Socket error on client 5002919232a0, disconnecting.
kevinkk525 commented 3 years ago

Hmm it looks to me like your device gets disconnected from the broker. Do you use the same config on your working esp8266? A ping_interval of 0 actually results in no ping in mqtt_as. And if your broker expects a ping, it'll disconnect the client with a timeout.

Looking at the source of _keep_connected() this sounds to me like it will disconnect the network if the broker is gone or mqtt_as thinks the client is disconnected. But why not try to reconnect to the broker and kill the complete network connection? Is that supposed to be something like a clean slate thing? Or maybe I am just not getting the idea.

It's more resilient this way because sometimes the wireless connection can cause trouble and disconnecting the wifi makes sure you either get a working connection again or keep trying. Personally I can't say if it's needed, I never tried it without that.

solarjoe commented 3 years ago

ping_interval of 0 means that keepalive will take over (https://github.com/peterhinch/micropython-mqtt/blob/master/mqtt_as/mqtt_as.py#L457)

The working esp8266 uses a rather old code base, but similar. Just saw it uses umqtt.simple. It has been running like a charm for maybe three years. I never saw it in the router logs. The code is rather resilient, it will reboot on most exceptions :) But it's the same broker, keepalive 60, publish-only, so no need to ping. It just reads a pin every 30 seconds and publishes the voltage level.

kevinkk525 commented 3 years ago

Hmm you are right, the ping_interval defaults to keepalive // 4.. the ping interval was just a common cause for disconnection. Now I don't know another reason either. It works stable on my esp32s. I just checked and I run the code with keepalive=120 and no ping_interal given.. So that shouldn't be the reason.

Just another thought: You run both esp32 parallel but with different mqtt client ids right? Because that would cause them to disconnect each other. (I know, unlikely since you run multiple esp8266 and know what you're doing but sometimes during testing I made that mistake)

solarjoe commented 3 years ago

The two ESP32 have different client IDs derived from machine.unique_id() and I can see that they are different in the broker log. Same for the ESP8266, but I can't subscribe to $SYS/broker/# for whatever reason to be 100 % sure. Also, shouldn't this result in immediate disconnects and not have a time delay of at least a minute?

I am trying to get to the bottom of this. This is from mqtt_as.py:

async def wait_msg(self):
    res = self._sock.read(1)  # Throws OSError on WiFi fail
    ....
    if res == b'':
        raise OSError(-1)

async def _handle_msg(self):
    try:
        await self.wait_msg()  # Immediate return if no message
    except OSError:
            pass
    self._reconnect()  # Broker or WiFi fail.

The reason in my case seems to be an OSError from wait_msg() that triggers the reconnect when the socket call returns b''. But what does that mean? When does the socket return that?

The comment in the code states that the error is thrown on WiFi fail. But a call to self._sta_if.isconnected() at that point shows that the WiFi is still up the next second it is disconnected when _keep_connected() calls self._sta_if.disconnect(). At least the ESP32 thinks that the WiFi is still up...

solarjoe commented 3 years ago

I think I remember that the same OSError with b'' was thrown by umqtt when I forgot to ping() and exceeded the keepalive period. Does that make sense?

solarjoe commented 3 years ago

I could not figure out if there is a MQTT control packet when I receive a message from a topic I subscribed to and if that control packet counts as a keep alive packet.

The spec says

It is the responsibility of the Client to ensure that the interval between Control Packets being sent does not exceed the Keep Alive value. In the absence of sending any other Control Packets, the Client MUST send a PINGREQ Packet.

solarjoe commented 3 years ago

Hm, it's getting warmer...

I added a manual await client._ping() to my application code. So far no outages.

I think the async def _keep_alive() does not work as expected. It relies on the value of self.last_rx to figure out if a ping needs to be sent. And this value is updated in several places, e.g. async def _as_read().

In my case, subscriber only, just receiving messages updates the self.last_rx and _ping() is never called at all! This could be fine, if there is some packet like MSGRSP that the client sends after receiving a subscribed message and that would work as a "alive packet". But I am not sure if this is how it works...

That's the point I mentioned above - I am not sure if reading from the socket is enough to keep the connection alive. I think you need so send MQTT control packets to the server, not necessarily a PINGREG, some other also seem to work.

So, the self.last_rx might be the wrong value to use, maybe something like self.last_tx used in async def _as_write() is needed and a check if the packet transmitted accounts for an "alive packet".

https://www.hivemq.com/blog/mqtt-essentials-part-10-alive-client-take-over/

From what I read here, the b'' that raises the OSError means that the broker did not close the socket, but will not send anymore data:

In fact it is not even a full close: the peer only indicates that it will not send any more data, but it might still receive data.

peterhinch commented 3 years ago

The reason in my case seems to be an OSError from wait_msg() that triggers the reconnect when the socket call returns b''. But what does that mean? When does the socket return that?

On a nonblocking socket a call to socket.read() returns None if there is no data available. A return of b'' indicates connection closed by host as here.

solarjoe commented 3 years ago

With the call to await client._ping() on the application side the code is running now for five hours straight without a single WiFi outage. My suspicion is that there is something off with _keep_alive() and self.last_rx.

Can someone reproduce this using a subscribe-only client? It is helpful to print the values for self.last_rx and pings_due. With my code I never hit the case elif pings_due >= 1:

solarjoe commented 3 years ago

Some more information on what I think is happening. I have an publish-only ESP8266 that is publishing with QoS 0. The two ESP32 with the WiFi connection problems are subscribe-only with QoS 1

The MQTT specs [MQTT-3.8.4-6] state

The Server might grant a lower maximum QoS than the subscriber requested. The QoS of Payload Messages sent in response to a Subscription MUST be the minimum of the QoS of the originally published message and the maximum QoS granted by the Server.

The MQTT specs chapter on QoS states that with QoS 0 the broker does NOT require a response control packet from the subscribing / receiving side.

For QoS 1 the packet PUBACK is expected, for QoS 2 the packets PUBREC and PUBCOMP are expected.

The situation as I understand it is:

It is easy to miss this case, because what is keeping many clients alive is not the function async def _keep_alive() in mqtt_as as one would expect. Instead they are kept alive only by the response packets to messages that were published by other devices with QoS > 0.

So it's basically not the client that keeps itself alive, as it should be, but merely some other device's choice on QoS.

Good theory?

peterhinch commented 3 years ago

Yes, I think you have cracked it. Well spotted. We tested the subscribe-only case, but not this scenario. Perhaps the code should read:

    async def _keep_alive(self):
        while self.isconnected():
            pings_due = ticks_diff(ticks_ms(), self.last_rx) // self._ping_interval
            if pings_due >= 4:  # Ping responses missed
                self.dprint('Reconnect: broker fail.')
                break
            await asyncio.sleep_ms(self._ping_interval)
            try:
                await self._ping()
            except OSError:
                break
        self._reconnect()  # Broker or WiFi fail.

In other words we ping at the ping interval regardless of received messages.

Any thoughts, @kevinkk525 @solarjoe ?

kevinkk525 commented 3 years ago

oh.. I never went deep enough into mqtt to realize that my qos=1 subscriptions are actually downgraded to qos=0 if the message was published with qos=0.. Now I had to check my code again :D But I use qos=1 everywhere in my project so I guess I never actually came across the downgrading of qos.

So last_rx is only to be updated when sending back a response to the broker, not when receiving data. Did I understand that correctly? If so, it could be updated every time we send back a SUBACK. But personally I don't think such optimizations are neccessary. And additional ping message every _ping_interval should hardly make a difference for any application. You'd need extremely high throughput to have a problem with that additional short message. However, if you have such big throughput you are likely using qos=0 for publish, which then requires you to ping anyway. And if you use qos=1 for all your messages, there is already such a big overhead, the ping message won't even be noticed as sending is a lot faster than receiving a PUBACK.

Therefore I think just sending a ping every _ping_interval as proposed by @peterhinch is the best option and all optimizations to reduce the amount of ping messages can then be removed.

peterhinch commented 3 years ago

optimizations to reduce the amount of ping messages can then be removed.

That was the purpose of the existing _keep_alive code. I now can't see a way to reduce the number of pings. Timing from the last transmission won't work because qos 0 transmissions don't get a response (pings force a response from the broker). As you say, attempting to avoid "unnecessary" pings is probably pointless in the context of throughput.

peterhinch commented 3 years ago

I've pushed this change. I tested basic functionality. @solarjoe please check in your configuration.

solarjoe commented 3 years ago

Will check later. But it looks good to me, pragmatic and clean.

Another way to solve it would be to introduce a similar approach using a time for latest transmissionself.last_tx and update it in async def wait_msg at the section where the PUBACK is sent back (or in case of the not yet implemented QoS 2 PUBREC and PUBCOMP. Or any other magic control packet that counts as sign of life. But maybe don't open that door :)

if op & 6 == 2:  # qos 1
    pkt = bytearray(b"\x40\x02\0\0")  # Send PUBACK
    struct.pack_into("!H", pkt, 2, pid)
    await self._as_write(pkt)
elif op & 6 == 4:  # qos 2 not supported
    raise OSError(-1)
peterhinch commented 3 years ago

Point taken.

But maybe don't open that door :)

Yes, let's keep it simple :)

solarjoe commented 3 years ago

Works like a charm for me, running for two hours without a single outage.

peterhinch commented 3 years ago

Excellent. Thank you for identifying this issue.