peterhinch / micropython-mqtt

A 'resilient' asynchronous MQTT driver. Recovers from WiFi and broker outages.
MIT License
549 stars 116 forks source link

ESP32 MPv1.21.1 Wifi state goes briefly to IDLE after CONNECTING #129

Closed jbfuzier closed 5 months ago

jbfuzier commented 5 months ago

Hello,

Firstly, thanks for this great project.

I ran into an issue on an ESP32 with micropython v1.21.1. When a disconnection occured, mqtt_as tries to reconnect forever but fails. After adding some debugging, I think I have pinpointed the root cause :

During a successful connection, my ESP32 goes through the following status : CONNECTING (1001)-> IDLE (1000) -> GOT_IP (1010).

The idle status causes the break on line 603 :

image

Which triggers OSError :

image

Which is caught on line 749 and result in a infinite disconnect / reconnect cycle (line 743) :

image

Here is some debug log commented to tell where the print are placed in the code :

Calling sta_if.disconnect (line 743)
I (3879999) wifi:state: run -> init (0)
I (3880009) wifi:pm stop, total sleep time: 0 us / 10975493 us

I (3880009) wifi:<ba-del>idx
I (3880009) wifi:new:<1,0>, old:<1,0>, ap:<255,255>, sta:<1,0>, prof:1
I (3880009) wifi: STA_DISCONNECTED, reason:8:
connecting to ssid (line 595)
I (3881029) wifi:new:<1,0>, old:<1,0>, ap:<255,255>, sta:<1,0>, prof:1
I (3881039) wifi:state: init -> auth (b0)
I (3881039) wifi:state: auth -> assoc (0)
I (3881049) wifi:state: assoc -> run (10)
I (3881069) wifi:connected with pineapple2g, aid = 4, channel 1, BW20, bssid = d0:21:f9:4c:93:d1
I (3881069) wifi:security: WPA2-PSK, phy: bgn, rssi: -45
I (3881079) wifi:pm start, type: 0

I (3881079) network: CONNECTED
I (3881099) wifi:<ba-add>idx:0 (ifx:0, d0:21:f9:4c:93:d1), tid:0, ssn:1, winSize:64
I (3881139) wifi:AP's beacon interval = 102400 us, DTIM period = 1
Wifi status : 1000 (line 602)
wifi_connect returned exception Wi-Fi connect timed out (line 750)
Wifi link is down (Line 741)
DBG : wifi status=1000 (Added line 741 before the self._sta_if.disconnect() call, simply a 10 iteration loop which prints self._sta_if.status(). We can see here the transient IDLE state (1000))
I (3882079) network: GOT_IP
I (3882079) esp_netif_handlers: sta ip: 192.168.92.122, mask: 255.255.255.0, gw: 192.168.92.1
DBG : wifi status=1010 (Still the same  debug loop. And now we are in GOT_IP state)
DBG : wifi status=1010
DBG : wifi status=1010
DBG : wifi status=1010
DBG : wifi status=1010
DBG : wifi status=1010
DBG : wifi status=1010
DBG : wifi status=1010
DBG : wifi status=1010
Calling sta_if.disconnect

For my previous project, I was using mp 1.19 with mqtt_as and I am pretty sure there was no issue. I see several plausible explanations :

Anyway, I think the mqtt_as code should handle the worst possible scenarios and as such should take into account this ESP32 behavior.

(The fix I am using which totally solved the issue for me is not very elegant but looks like that. Arround line 602) :

image
ebolisa commented 5 months ago

Instead of using break, shouldn’t you use reset? Since you went through 10 loops with no avail.

peterhinch commented 5 months ago

On the face of it this could be fixed by changing this line to read:

if s.status() not in (network.STAT_CONNECTING, network.STAT_IDLE):
    break

However I need to check whether this occurs on the reference board running the latest firmware (V1.22.1). I have never seen this behaviour in the past.

peterhinch commented 5 months ago

I can't replicate this. The following is a traceback of the range_ex.py demo over a WiFi outage. Hardware is the ESP32 reference board running 1.22.1, and recovery is normal:

Topic: "foo_topic" Message: "gordon bennett" Retained: False
publish 25
RAM free 128400 alloc 24560
publish 26
WiFi or broker is down.
Checking WiFi integrity.
Got reliable connection
Connecting to broker.
Connected to broker.
Reconnect OK!
We are connected to broker.
Topic: "foo_topic" Message: "pete was here" Retained: False
jbfuzier commented 5 months ago

On the face of it this could be fixed by changing this line to read:

if s.status() not in (network.STAT_CONNECTING, network.STAT_IDLE):
    break

However I need to check whether this occurs on the reference board running the latest firmware (V1.22.1). I have never seen this behaviour in the past.

Good point, much simpler and makes more sense this way.

I am building the same project on another board, I will check if I have the same behavior with this board. Maybe the 1st board is not good.

jbfuzier commented 5 months ago

I build the same project with two other board, same code (but esp32 S3 instead of esp32) and did not run into the same issue.

Looking back at the first board, it is working but is experiencing much more deconnection / instability than others. So there is likely something wrong with the board. Howerver it is still kind of working with the fix.

peterhinch commented 5 months ago

OK, thanks for letting me know. I think we've resolved this one.

peterhinch commented 4 months ago

Others have reported problems with firmware V1.20 and later (e.g. #132 and #133). We have figure out the reason and I have pushed an update which addresses this.