knolleary / pubsubclient

A client library for the Arduino Ethernet Shield that provides support for MQTT.
http://pubsubclient.knolleary.net/
MIT License
3.83k stars 1.47k forks source link

connect function stucks when network is lost #444

Open sysizlayan opened 6 years ago

sysizlayan commented 6 years ago

Hi, I was testing a case where network is lost on my esp32 project. While I was testing I simply turn off the router then open it after some time.

The problem occurs when a mqtt client tries to connect. I turned off the network just before the mqtt client tries to connect. The blocked segment is given below:

if (WiFi.isConnected())
   {
    WiFiClient _mqttPubClient;

    PubSubClient mqttPubClient(mqttServer, mqttPort, mqttPubCallback, _mqttPubClient);
    // WIFI connection is cut
    // WORKS UP TO HERE!
    if (mqttPubClient.connect("MyPubClient"))
    {
      // DOES NOT ENTER HERE, DEBUG LOGS SHOWS STA_DISCONNECTED here
      DEBUG_UART.printf("INFO [PERIODIC_STATE_UPDATER] State publisher connected!\r\n\r\n");

      int publishStatus = mqttPubClient.publish("DEVICE_STATE", jsonState.c_str());
      DEBUG_UART.printf("INFO [PERIODIC_STATE_UPDATER] Mqtt publish status: %d\r\n\r\n", publishStatus);
      if (publishStatus == 1)
        return true;
      else
        return false;
    }
    else
    {
     // DOES NOT ENTER HERE TO
      DEBUG_UART.printf("ERROR [PERIODIC_STATE_UPDATER] Mqtt Publisher cannot connect to the server!");
      return false;
    }
    mqttPubClient.disconnect();
  }
  else
    return false;

The program remains stuck until the watchdog is triggered. I cannot solve the problem.

Please check if there is a timeout or not for that. Thanks in advance :)

tsathishkumar commented 6 years ago

I can confirm the same behaviour. We were noticing our esp32s go into a zombie state if we keep it running for a day or two. Found the issue to be pubsub client connect hangs whenever the wifi connection is lost. I was about to create a new issue but found it is already created :)

tsathishkumar commented 6 years ago

To be more specific, it hangs exactly on client connect method call _client->connect(this->domain, this->port); or _client->connect(this->ip, this->port).

PubSubClient.cpp:121

or

PubSubClient.cpp:123

tsathishkumar commented 6 years ago

Thinking about it, may be not related to pubsubclient library. We may need to fix it at esp32 wifi client level.

tsathishkumar commented 6 years ago

I did some alteration to my code, to listen to WiFiEvents and handling disconnect there. This resolved the issue. Once the WiFi connection is back, pubsubclient resumes the connection and able to connect to server.

Sample WiFiEvent handling for reference.

#include <WiFi.h>
const char* ssid="ssid"
const char* password="password"
static volatile bool wifi_connected = false;

void WiFiEvent(WiFiEvent_t event)
{
  switch (event)
  {
  case SYSTEM_EVENT_AP_START:
    //can set ap hostname here
    WiFi.softAPsetHostname(_node_name.c_str());
    //enable ap ipv6 here
    WiFi.softAPenableIpV6();
    break;

  case SYSTEM_EVENT_STA_START:
    //set sta hostname here
    WiFi.setHostname(_node_name.c_str());
    break;
  case SYSTEM_EVENT_STA_CONNECTED:
    //enable sta ipv6 here
    WiFi.enableIpV6();
    break;
  case SYSTEM_EVENT_AP_STA_GOT_IP6:
    //both interfaces get the same event
    Serial.print("STA IPv6: ");
    Serial.println(WiFi.localIPv6());
    Serial.print("AP IPv6: ");
    Serial.println(WiFi.softAPIPv6());
    break;
  case SYSTEM_EVENT_STA_GOT_IP:
    Serial.println("WiFi connected");
    Serial.println("IP address: ");
    Serial.println(WiFi.localIP());
    wifi_connected = true;
    break;
  case SYSTEM_EVENT_STA_DISCONNECTED:
    wifi_connected = false;
    Serial.println("STA Disconnected");
    delay(1000);
    WiFi.begin(ssid, password);
    break;
  }
}

void setup()
{
  WiFi.onEvent(WiFiEvent);
  WiFi.begin(ssid, password);
}

void loop() {
  if(wifi_connected) {
    wifi_connected_loop();
  } else {
    wifi_not_connected_loop();
  }
}

void wifi_connected_loop() {
//Do something here
}

void wifi_not_connected_loop() {
//Do something else here
}

This issue can be closed and may be the sample code can be updated with WiFiEvents rather than blocking while loop.

knolleary commented 6 years ago

@tsathishkumar that looks really good - I wasn't aware of the WifiEvents stuff. If you wanted to open a pull-request with a suitable update to the example that would be excellent.

tsathishkumar commented 6 years ago

Sure, I'll do that.

sysizlayan commented 6 years ago

My problem cannot be solved by event action because event is fired after connect function of the library is called. Only way to handle that way may be deleting the main loop thread and creating that again, it is no different than restarting the esp32. Do you think is it because of the client class written in esp32/arduino core?

tsathishkumar commented 6 years ago

Actually it can be solved with events. I had tested it. Because once the event thread successfully establishes the connection, other thread will continue working fine.

Can you test and update here please?

On Mon, Jun 18, 2018, 8:52 PM sysizlayan notifications@github.com wrote:

My problem cannot be solved by event action because event is fired after connect function of the library is called. Only way to handle that way may be deleting the main loop thread and creating that again, it is no different than restarting the esp32. Do you think is it because of the client class written in esp32/arduino core?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/knolleary/pubsubclient/issues/444#issuecomment-398092342, or mute the thread https://github.com/notifications/unsubscribe-auth/ABaM5FUlvhrI1ve8SA7tbMom_s7dumPiks5t98W6gaJpZM4UN7wG .

tanmaysathe commented 6 years ago

@tsathishkumar I think we can safely remove: delay(1000); WiFi.begin(ssid, password); from SYSTEM_EVENT_STA_DISCONNECTED event. WiFi.begin is required only if ssid & password needs to be changed.

tsathishkumar commented 6 years ago

@tanmaysathe But, does it automatically try to connect again?

tanmaysathe commented 6 years ago

Once wifi connection is made, the chip stores SSID & password in the flash. It automatically reconnects to last used access point without any need of calling WiFi.begin function thereafter. https://arduino-https://arduino-esp8266.readthedocs.io/en/latest/esp8266wifi/readme.html#quick-start

tsathishkumar commented 6 years ago

I remember reading an issue where the auto reconnect was not working. Not sure if that was resolved. Found a reference in ESP32 site.

http://esp-idf.readthedocs.io/en/latest/api-guides/wifi.html

s6.2: In the scenario described above, the application event callback function relays to the application task. We recommend that esp_wifi_connect() be called to reconnect the Wi-Fi, close all sockets and re-create them if necessary. Refer to .

So, the recommendation is to close all open sockets, reconnect WiFi and recreate the sockets.

sysizlayan commented 6 years ago

Autoreconnect can be configured in esp32 arduino core but the support for persistent connection(Where ssid and password is saved to non-volatile memory) is still and issue. For autoconnect, one needs to first start the WiFi connection with ssid and password, those values are saved in the memory. If the connection is broken, it tries to reconnect again automatically until restart.

My issue is still going by the way @tsathishkumar. To be more clear, I first call the "connect" function of pubsubclient. Then the event is fired as STA_DISCONNECTED. During this process, the connection trial does not stop, only the event function is called. Even if I call the WiFi.begin again, when the event function returns, my code waits on the porint where I call connect. The socket or the client does not give any timeout error.

tsathishkumar commented 6 years ago

Even if I call the WiFi.begin again, when the event function returns, my code waits on the porint where I call connect. The socket or the client does not give any timeout error.

You mean, you got the SYSTEM_EVENT_STA_CONNECTED event and still the pubsub client was waiting on previous call? Because when I tested the pubsub client returned after the successful connection.

sysizlayan commented 6 years ago

I mean exactly that. In normal operation, the connect function should return a timeout and keep going but the system stucks in there. Imagine a case where the Wifi does not exists anymore, the program will remain in there forever.

Because of that it is an issue, the client does not return any timeout.

tsathishkumar commented 6 years ago

In my case, when the WiFi connected again, the function returned. I will do another test sometime later and update here.

sysizlayan commented 6 years ago

I understand and it may be possible but it shows only that there is a while loop somewhere. If the network is not available anymore, the code will remain there forever. I am currently detecting that using a watchdog timer and only recover way is restart.

What I want to understand, is the loop causing this in pubsubclient or in wifiClient?

tsathishkumar commented 6 years ago

It is in wifi client as far as I know. @knolleary please correct me if I'm wrong.

shubhamrjadhav commented 5 years ago

Im using this library in my CC3200 energia and my code is getting stuck at if (!client.connect(flash_buffer, "xxx", "xxx")) function, i have done all changes given above but still no effect my device is getting hang at that particular point if disconnected

if (!client.connected()) { //Serial.println("Disconnected. Reconnecting...."); led_indication_data = 0x01; master_led_config();

if (!client.connect(flash_buffer, "xxx", "xxx"))
{
  led_indication_data = 0x01;
  master_led_config();
  //Serial.println("Connection failed");
}
else
{
  led_indication_data = 0x02;
  master_led_config();
  //Serial.println("Connection success");
  if (client.subscribe(inTopic))
  {
    //Serial.println("Subscription successfull");
  }
}

}

rohitjust24idpl commented 5 years ago

There is one more case here.

  1. WiFi is available but internet is not there, still it gets hanged for few seconds.

Any workaround for this ?

cranton commented 5 years ago

Can this be the problem? PubSub uses the connect methods without timeout from the WiFi library. Internally the client uses -1 if (domain != NULL) { result = _client->connect(this->domain, this->port); } else { result = _client->connect(this->ip, this-> } maybe adding the timeout here resolves the problem if (domain != NULL) { result = _client->connect(this->domain, this->port, 2000); } else { result = _client->connect(this->ip, this->port, 2000); } or replace the 2000 with an constant in the header

cranton commented 5 years ago

As I just saw this only possible for the ESP32, and not for the ESP8266.